SparkR(1)Naive Bayesian

SparkR(1)NaiveBayesian

1.NaiveBayesian

P(A|B)=P(B|A)P(A)/P(B)

Features-F1,F2,…Fn

Category-C1,C2,…Cm

P(C|F1F2…Fn)=P(F1F2…Fn|C)P(C)/P(F1F2…Fn)

P(F1F2…Fn|C)P(C)=P(F1|C)P(F2|C)…P(FN|C)P(C)

2.PreparetheEnvironment

spark-1.4.1

Ijustdownloadthelatestversionandplacethatinmyclasspath

http://mirror.nexcess.net/apache//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz

R-3.2.2

>r--version

Rversion3.2.2(2015-08-14)--"FireSafety"

Copyright(C)2015TheRFoundationforStatisticalComputing

Platform:x86_64-apple-darwin13.4.0(64-bit)

Rstudioversion0.99.473

3.StarttheSparkwithRshell

>bin/sparkR--masterlocal[2]

Andwecandirectlyputwhatwewantintotheshellfromthisexample

https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

4.ExecuteRscriptinSparkR

https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json

>bin/spark-submitexamples/src/main/r/dataframe.R

5.RuntheRCodesinRstudio

InstalltheJDK1.6onmyMAC

https://support.apple.com/kb/DL1572?locale=en_US

ThefileIdownloadisfromhere.

http://supportdownload.apple.com/download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Mac_OS_X/downloads/031-29055.20150831-0f779fb2-4bf4-11e5-a8d8-/javaforosx.dmg

Movethebinarysparkfileto/opt/spark

>mvspark-1.4.1-bin-hadoop2.6.tgz/opt/spark/

AndthissampleRcodescanberunontheRstudio

##downloadalltherelatedpackages

mypkgs<-c("dplyr","ggplot2","magrittr")

install.packages(mypkgs)

Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home")#mypathinLinuxUbuntu

library("rJava")

mySparkRpackagepath<-"/opt/spark/spark-1.4.1-bin-hadoop2.6.tgz"

install.packages(mySparkRpackagepath)

library("SparkR",lib.loc="/opt/spark/R/lib")

library("SparkR")

Sys.setenv(SPARK_HOME="/opt/spark")

sc<-sparkR.init(master="local",appname="SparkR_demo_RTA",

sparkHome="/opt/spark")

sqlContext<-sparkRSQL.init(sc)

hiveContext<-sparkRHive.init(sc)

path<-file.path(Sys.getenv("SPARK_HOME"),

"examples/src/main/resources/people.json")

peopleDF<-jsonFile(sqlContext,path)

printSchema(peopleDF)

head(peopleDF)

6.FurtherExample

https://github.com/kiendang/sparkr-naivebayes-example

http://www.slideshare.net/KienDang5/introduction-to-sparkr

DataTypesofRlanguage

Vector

>c(1,2,3,4)

[1]1234

>1:4

[1]1234

>c("a","b","c")

[1]"a""b""c"

>c(T,F,T)

[1]TRUEFALSETRUE

Matrix

>matrix(c(1,2,3,4),ncol=2)

[,1][,2]

[1,]13

[2,]24

>

>matrix(c(1,2,3,4),ncol=2,byrow=T)

[,1][,2]

[1,]12

[2,]34

List

>list(12,"twelve")

[[1]]

[1]12

[[2]]

[1]"twelve"

>list(1,2,3)

[[1]]

[1]1

[[2]]

[1]2

[[3]]

[1]3

Dataframe

>name<-c("A","B","C")

>age<-c(30,17,42)

>male<-c(T,F,F)

>data.frame(name,age,male)

nameagemale

1A30TRUE

2B17FALSE

3C42FALSE

runif(n,min=0,max=1)average

x<-1:100

y<-1:100+runif(100,0,20)

>m<-lm(y~x)

>plot(y~x)

>abline(m$coefficients)

Rissingle-threaded,canonlyprocessdatasetsthatfitinasinglemachine.

SparkRallowsuserstointeractivelyrunjobsfromtheRshellonacluster.

FamousWordCountExample

starttheshell

>bin/sparkR--masterlocal[2]

>rdd<-SparkR:::textFile(sc,'README.md')

>counts<-SparkR:::map(rdd,nchar)

>SparkR:::take(counts,3)

[[1]]

[1]14

[[2]]

[1]0

[[3]]

[1]78

Supervisedmachinelearning,NaiveBayes,Classifiestextsbasedonthewordfrequency.

References:

http://www.iteblog.com/archives/1385

http://spark.apache.org/docs/latest/sparkr.html

https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/BIDS/sparkR-demo

http://ampcamp.berkeley.edu/5/exercises/sparkr.html

https://github.com/kiendang/sparkr-naivebayes-example

naivebayesian

http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html

http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html

algorithm

http://www.ruanyifeng.com/blog/algorithm/