Prediction(6)PyLib and Machine Learning
Prediction(6)PyLibandMachineLearning
1.Introduction
Anensemblemethodwillcreateamodelcomposedofasetofotherbasemodels.GradientboostedtreesandRandomForestbothusedecisiontreesastheirbasemodels.
GBTstrainonetreeatatime,sotheycantakelongertotrainthanrandomforests.RandomForestsinparallel.(smallertreeswithGBTs)
TrainingmoretreesinaRandomForestreducesthelikelihoodofoverfitting.MoretreeswithGBTsincreasesthelikelihoodofoverfitting.
RandomForestsreducevariancebyusingmoretrees,GBTsreducebiasbyusingmoretrees.
2.TrywithRandomForests
ErrorMessageinZeppelin:
Traceback(mostrecentcalllast):File"/tmp/zeppelin_pyspark.py",line162,in<module>eval(compiledCode)File"<string>",line1,in<module>File"/opt/spark/python/pyspark/mllib/__init__.py",line25,in<module>importnumpyImportError:Nomodulenamednumpy
Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy
Downloadthelatestfilefromhttp://sourceforge.net/projects/numpy/files/NumPy/
>wgethttp://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz
>sudopythonsetup.pyinstall
Verifytheinstallation
>python
python>>>importnumpy
python>>>exit()
ErrorMessageinZeppelinLogs
ERROR[2015-10-0614:14:40,447]({qtp1852584274-48}NotebookServer.java[runParagraph]:630)-Exceptionfromrun
org.apache.zeppelin.interpreter.InterpreterException:pysparkinterpreternotfound
atorg.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
atorg.apache.zeppelin.notebook.Note.run(Note.java:282)
atorg.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
atorg.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/
>mvncleanpackage-Pspark-1.5-Dpyspark-Dspark.version=1.5.0-Dhadoop.version=2.7.0-Phadoop-2.6-Pyarn-DskipTests-Pbuild-distr
Trythesecodesinzeppelin.
%pyspark
sc.parallelize([1,2,3]).count()
Exception:
Errorfrompythonworker:/usr/bin/python:NomodulenamedpysparkPYTHONPATHwas:/home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jarjava.io.EOFException
Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to
Addthisinzeppelinconfigurationfile.
exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
Itshouldberightthere.ButtheVMsareslow.SoIdidnotmakeitperfectlyworking.Imaytrythisinlaterversion.
3.SetupSingleMode
OnlytheseconfigurationforzeppelininlocalMODE
exportSPARK_HOME="/opt/spark"
.${SPARK_HOME}/conf/spark-env.sh
exportZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
Andthesinglemodeisworkinggreatforme.AndthespeedisalsomuchbetterthanintheVMs.
4.RandomForestSampleonZeppelin
%pyspark
frompyspark.mllib.regressionimportLabeledPoint
frompyspark.mllib.treeimportRandomForest,RandomForestModel
frompyspark.mllib.utilimportMLUtils
%pyspark
data=MLUtils.loadLibSVMFile(sc,"/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData,testData)=data.randomSplit([0.7,0.3])
%pyspark
model=RandomForest.trainRegressor(trainingData,categoricalFeaturesInfo={},
numTrees=3,featureSubsetStrategy="auto",
impurity='variance',maxDepth=4,maxBins=32)
%pyspark
predictions=model.predict(testData.map(lambdax:x.features))
labelsAndPredictions=testData.map(lambdalp:lp.label).zip(predictions)
testMSE=labelsAndPredictions.map(lambda(v,p):(v-p)*(v-p)).sum()/float(testData.count())
print('TestMeanSquaredError='+str(testMSE))
print('Learnedregressionforestmodel:')
print(model.toDebugString())
References:
http://spark.apache.org/docs/latest/mllib-ensembles.html
SetupZeppelinAgainwithPython