Prediction(6)PyLib and Machine Learning

Prediction(6)PyLibandMachineLearning

1.Introduction

Anensemblemethodwillcreateamodelcomposedofasetofotherbasemodels.GradientboostedtreesandRandomForestbothusedecisiontreesastheirbasemodels.

GBTstrainonetreeatatime,sotheycantakelongertotrainthanrandomforests.RandomForestsinparallel.(smallertreeswithGBTs)

TrainingmoretreesinaRandomForestreducesthelikelihoodofoverfitting.MoretreeswithGBTsincreasesthelikelihoodofoverfitting.

RandomForestsreducevariancebyusingmoretrees,GBTsreducebiasbyusingmoretrees.

2.TrywithRandomForests

ErrorMessageinZeppelin:

Traceback(mostrecentcalllast):File"/tmp/zeppelin_pyspark.py",line162,in<module>eval(compiledCode)File"<string>",line1,in<module>File"/opt/spark/python/pyspark/mllib/__init__.py",line25,in<module>importnumpyImportError:Nomodulenamednumpy

Solution:

http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Downloadthelatestfilefromhttp://sourceforge.net/projects/numpy/files/NumPy/

>wgethttp://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

>sudopythonsetup.pyinstall

Verifytheinstallation

>python

python>>>importnumpy

python>>>exit()

ErrorMessageinZeppelinLogs

ERROR[2015-10-0614:14:40,447]({qtp1852584274-48}NotebookServer.java[runParagraph]:630)-Exceptionfromrun

org.apache.zeppelin.interpreter.InterpreterException:pysparkinterpreternotfound

atorg.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)

atorg.apache.zeppelin.notebook.Note.run(Note.java:282)

atorg.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)

atorg.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:

http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvncleanpackage-Pspark-1.5-Dpyspark-Dspark.version=1.5.0-Dhadoop.version=2.7.0-Phadoop-2.6-Pyarn-DskipTests-Pbuild-distr

Trythesecodesinzeppelin.

%pyspark

sc.parallelize([1,2,3]).count()

Exception:

Errorfrompythonworker:/usr/bin/python:NomodulenamedpysparkPYTHONPATHwas:/home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jarjava.io.EOFException

Solution:

http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Addthisinzeppelinconfigurationfile.

exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

Itshouldberightthere.ButtheVMsareslow.SoIdidnotmakeitperfectlyworking.Imaytrythisinlaterversion.

3.SetupSingleMode

OnlytheseconfigurationforzeppelininlocalMODE

exportSPARK_HOME="/opt/spark"

.${SPARK_HOME}/conf/spark-env.sh

exportZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

Andthesinglemodeisworkinggreatforme.AndthespeedisalsomuchbetterthanintheVMs.

4.RandomForestSampleonZeppelin

%pyspark

frompyspark.mllib.regressionimportLabeledPoint

frompyspark.mllib.treeimportRandomForest,RandomForestModel

frompyspark.mllib.utilimportMLUtils

%pyspark

data=MLUtils.loadLibSVMFile(sc,"/opt/spark/data/mllib/sample_libsvm_data.txt")

(trainingData,testData)=data.randomSplit([0.7,0.3])

%pyspark

model=RandomForest.trainRegressor(trainingData,categoricalFeaturesInfo={},

numTrees=3,featureSubsetStrategy="auto",

impurity='variance',maxDepth=4,maxBins=32)

%pyspark

predictions=model.predict(testData.map(lambdax:x.features))

labelsAndPredictions=testData.map(lambdalp:lp.label).zip(predictions)

testMSE=labelsAndPredictions.map(lambda(v,p):(v-p)*(v-p)).sum()/float(testData.count())

print('TestMeanSquaredError='+str(testMSE))

print('Learnedregressionforestmodel:')

print(model.toDebugString())

References:

http://spark.apache.org/docs/latest/mllib-ensembles.html

SetupZeppelinAgainwithPython

相关推荐