SparkR(1)Naive Bayesian
SparkR(1)NaiveBayesian
1.NaiveBayesian
P(A|B)=P(B|A)P(A)/P(B)
Features-F1,F2,…Fn
Category-C1,C2,…Cm
P(C|F1F2…Fn)=P(F1F2…Fn|C)P(C)/P(F1F2…Fn)
P(F1F2…Fn|C)P(C)=P(F1|C)P(F2|C)…P(FN|C)P(C)
2.PreparetheEnvironment
spark-1.4.1
Ijustdownloadthelatestversionandplacethatinmyclasspath
http://mirror.nexcess.net/apache//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz
R-3.2.2
>r--version
Rversion3.2.2(2015-08-14)--"FireSafety"
Copyright(C)2015TheRFoundationforStatisticalComputing
Platform:x86_64-apple-darwin13.4.0(64-bit)
Rstudioversion0.99.473
3.StarttheSparkwithRshell
>bin/sparkR--masterlocal[2]
Andwecandirectlyputwhatwewantintotheshellfromthisexample
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R
4.ExecuteRscriptinSparkR
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R
https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json
>bin/spark-submitexamples/src/main/r/dataframe.R
5.RuntheRCodesinRstudio
InstalltheJDK1.6onmyMAC
https://support.apple.com/kb/DL1572?locale=en_US
ThefileIdownloadisfromhere.
http://supportdownload.apple.com/download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Mac_OS_X/downloads/031-29055.20150831-0f779fb2-4bf4-11e5-a8d8-/javaforosx.dmg
Movethebinarysparkfileto/opt/spark
>mvspark-1.4.1-bin-hadoop2.6.tgz/opt/spark/
AndthissampleRcodescanberunontheRstudio
##downloadalltherelatedpackages
mypkgs<-c("dplyr","ggplot2","magrittr")
install.packages(mypkgs)
Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home")#mypathinLinuxUbuntu
library("rJava")
mySparkRpackagepath<-"/opt/spark/spark-1.4.1-bin-hadoop2.6.tgz"
install.packages(mySparkRpackagepath)
library("SparkR",lib.loc="/opt/spark/R/lib")
library("SparkR")
Sys.setenv(SPARK_HOME="/opt/spark")
sc<-sparkR.init(master="local",appname="SparkR_demo_RTA",
sparkHome="/opt/spark")
sqlContext<-sparkRSQL.init(sc)
hiveContext<-sparkRHive.init(sc)
path<-file.path(Sys.getenv("SPARK_HOME"),
"examples/src/main/resources/people.json")
peopleDF<-jsonFile(sqlContext,path)
printSchema(peopleDF)
head(peopleDF)
6.FurtherExample
https://github.com/kiendang/sparkr-naivebayes-example
http://www.slideshare.net/KienDang5/introduction-to-sparkr
DataTypesofRlanguage
Vector
>c(1,2,3,4)
[1]1234
>1:4
[1]1234
>c("a","b","c")
[1]"a""b""c"
>c(T,F,T)
[1]TRUEFALSETRUE
Matrix
>matrix(c(1,2,3,4),ncol=2)
[,1][,2]
[1,]13
[2,]24
>
>matrix(c(1,2,3,4),ncol=2,byrow=T)
[,1][,2]
[1,]12
[2,]34
List
>list(12,"twelve")
[[1]]
[1]12
[[2]]
[1]"twelve"
>list(1,2,3)
[[1]]
[1]1
[[2]]
[1]2
[[3]]
[1]3
Dataframe
>name<-c("A","B","C")
>age<-c(30,17,42)
>male<-c(T,F,F)
>data.frame(name,age,male)
nameagemale
1A30TRUE
2B17FALSE
3C42FALSE
runif(n,min=0,max=1)average
x<-1:100
y<-1:100+runif(100,0,20)
>m<-lm(y~x)
>plot(y~x)
>abline(m$coefficients)
Rissingle-threaded,canonlyprocessdatasetsthatfitinasinglemachine.
SparkRallowsuserstointeractivelyrunjobsfromtheRshellonacluster.
FamousWordCountExample
starttheshell
>bin/sparkR--masterlocal[2]
>rdd<-SparkR:::textFile(sc,'README.md')
>counts<-SparkR:::map(rdd,nchar)
>SparkR:::take(counts,3)
[[1]]
[1]14
[[2]]
[1]0
[[3]]
[1]78
Supervisedmachinelearning,NaiveBayes,Classifiestextsbasedonthewordfrequency.
References:
http://www.iteblog.com/archives/1385
http://spark.apache.org/docs/latest/sparkr.html
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd
https://github.com/BIDS/sparkR-demo
http://ampcamp.berkeley.edu/5/exercises/sparkr.html
https://github.com/kiendang/sparkr-naivebayes-example
naivebayesian
http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html
http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html
algorithm
http://www.ruanyifeng.com/blog/algorithm/