Prediction(6)PyLib and Machine Learning

stoutT

2015-10-08

Prediction(6)PyLibandMachineLearning

1.Introduction

Anensemblemethodwillcreateamodelcomposedofasetofotherbasemodels.GradientboostedtreesandRandomForestbothusedecisiontreesastheirbasemodels.

GBTstrainonetreeatatime,sotheycantakelongertotrainthanrandomforests.RandomForestsinparallel.(smallertreeswithGBTs)

TrainingmoretreesinaRandomForestreducesthelikelihoodofoverfitting.MoretreeswithGBTsincreasesthelikelihoodofoverfitting.

RandomForestsreducevariancebyusingmoretrees,GBTsreducebiasbyusingmoretrees.

2.TrywithRandomForests

ErrorMessageinZeppelin:

Traceback(mostrecentcalllast):File"/tmp/zeppelin_pyspark.py",line162,in<module>eval(compiledCode)File"<string>",line1,in<module>File"/opt/spark/python/pyspark/mllib/__init__.py",line25,in<module>importnumpyImportError:Nomodulenamednumpy

Solution:

http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Downloadthelatestfilefromhttp://sourceforge.net/projects/numpy/files/NumPy/

>wgethttp://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

>sudopythonsetup.pyinstall

Verifytheinstallation

>python

python>>>importnumpy

python>>>exit()

ErrorMessageinZeppelinLogs

ERROR[2015-10-0614:14:40,447]({qtp1852584274-48}NotebookServer.java[runParagraph]:630)-Exceptionfromrun

org.apache.zeppelin.interpreter.InterpreterException:pysparkinterpreternotfound

atorg.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)

atorg.apache.zeppelin.notebook.Note.run(Note.java:282)

atorg.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)

atorg.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:

http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvncleanpackage-Pspark-1.5-Dpyspark-Dspark.version=1.5.0-Dhadoop.version=2.7.0-Phadoop-2.6-Pyarn-DskipTests-Pbuild-distr

Trythesecodesinzeppelin.

%pyspark

sc.parallelize([1,2,3]).count()

Exception:

Errorfrompythonworker:/usr/bin/python:NomodulenamedpysparkPYTHONPATHwas:/home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jarjava.io.EOFException

Solution:

http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Addthisinzeppelinconfigurationfile.

exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

Itshouldberightthere.ButtheVMsareslow.SoIdidnotmakeitperfectlyworking.Imaytrythisinlaterversion.

3.SetupSingleMode

OnlytheseconfigurationforzeppelininlocalMODE

exportSPARK_HOME="/opt/spark"

.${SPARK_HOME}/conf/spark-env.sh

exportZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

exportPYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

Andthesinglemodeisworkinggreatforme.AndthespeedisalsomuchbetterthanintheVMs.

4.RandomForestSampleonZeppelin

%pyspark

frompyspark.mllib.regressionimportLabeledPoint

frompyspark.mllib.treeimportRandomForest,RandomForestModel

frompyspark.mllib.utilimportMLUtils

%pyspark

data=MLUtils.loadLibSVMFile(sc,"/opt/spark/data/mllib/sample_libsvm_data.txt")

(trainingData,testData)=data.randomSplit([0.7,0.3])

%pyspark

model=RandomForest.trainRegressor(trainingData,categoricalFeaturesInfo={},

numTrees=3,featureSubsetStrategy="auto",

impurity='variance',maxDepth=4,maxBins=32)

%pyspark

predictions=model.predict(testData.map(lambdax:x.features))

labelsAndPredictions=testData.map(lambdalp:lp.label).zip(predictions)

testMSE=labelsAndPredictions.map(lambda(v,p):(v-p)*(v-p)).sum()/float(testData.count())

print('TestMeanSquaredError='+str(testMSE))

print('Learnedregressionforestmodel:')

print(model.toDebugString())

References:

http://spark.apache.org/docs/latest/mllib-ensembles.html

SetupZeppelinAgainwithPython

apache 编程语言 python

安科网

Prediction(6)PyLib and Machine Learning

stoutT

stoutT

相关推荐

为什么Java仍将是未来的主流语言？

.NET Core下使用Kafka的方法步骤

解决PHPstudy Apache无法启动的问题【亲测有效】

Web安全：文件解析漏洞

终于有人把Nginx说清楚了，图文详解！

如何使用Apache Web服务器来安装和配置网站？

CentOS 8 Apache 安装后 SSL 重定向提示证书错误

如何使用 Apache Directory Studio 连接 JumpCloud

初学者和专业技术人员使用的十大机器学习软件

每个Java开发人员都应该知道的10大Github仓库

漫话：应用程序被拖慢？罪魁祸首竟然是Log4j！

JSP动态网页开发原理详解

centos8使用Apache httpd2.4.37安装web服务器的步骤详解

Tomcat启动springboot项目war包报错：启动子级时出错的问题

如何通过Apache在本地配置多个虚拟主机

Apache Shiro 反序列化(CVE-2016-4437)复现

Apache Shiro 反序列化(CVE-2016-4437)复现

Apache DolphinScheduler 诞生记

【Shiro】05 自定义Realm认证实现

Web容器Web服务器及常见的Web容器有哪些？

stoutT