Classification(3)Generate Features and Stem Adjust the Model System

Classification(3)GenerateFeaturesandStemAdjusttheModelSystem

1.ScalaOperation

StringMethod-contains

scala>vallongContent="carllovetostudypython,scala"

longContent:String=carllovetostudypython,scala

scala>longContent.contains("python")

res0:Boolean=true

MapMergeFunction

Directlyunderthetheprojectwhichwealreadyhavethejardependencies.

>sbtconsole

scala>importscalaz.Scalaz._

importscalaz.Scalaz._

scala>

scala>valm1=Map("0"->0,"1"->1)

m1:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1)

scala>valm2=Map("2"->2)

m2:scala.collection.immutable.Map[String,Int]=Map(2->2)

scala>valm3=m1|+|m2

m3:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1,2->2)

MapOperation

scala>m3

res1:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1,2->2)

scala>m3-"0"

res2:scala.collection.immutable.Map[String,Int]=Map(1->1,2->2)

Magicscalaz

https://github.com/scalaz/scalaz

Sliding

scala>(1to5).iterator.sliding(3).toList

res3:List[Seq[Int]]=List(List(1,2,3),List(2,3,4),List(3,4,5))

ListOperation

scala>List(1,2,3).zip(List("one","two","three"))

res8:List[(Int,String)]=List((1,one),(2,two),(3,three))

RunwithAssemblyJar

./spark-submit—num-executors2—driver-memory2G—classcom.sillycat.jobs.GenerateFeatureMap${path_to_jar}

NiceConfigurationinbuild.sbt

//There'saproblemwithjackson2.5+withSpark1.4.1

dependencyOverrides++=Set(

"com.fasterxml.jackson.core"%"jackson-databind"%"2.4.4"

)

WhenwebuildassemblyJar,WemayjustneedSparkCoreandrelatedprovided

"org.apache.spark"%%"spark-core"%"1.4.1"%"provided",//Apachev2

"org.apache.spark"%%"spark-mllib"%"1.4.1"%"provided",//Apachev2

2.DetailOperations

GenerateFeatureMap

step1.LoadJobInfofromS3(Onlytitleanddescription),cache()

step2.PlacethetitleanddescriptioninObject,RegextoFindtheTitleandDescriptionagain

step3.NormalizetheString

Fortitle:toLower—>filterallhtml—>stripChars,onlykeep[a-zA-Z\d\-]

Fordescription:toLower—>filterURL—>filterHTML—>stripChar—>stripNumber

step4.TokenizetheString

Wepredefinedalistofphrasesandstoredintextfile.2wordsand3words.

ForTitle:

Findthephrasesinthestringwhicharecontainedinthepre-definedlist.

ConvertthestringtowordsandphraseList

eg:bigdatasoftwareengineer—>big,data,software,engineer,bigdata,softwareengineer

(bigdataandsoftwareengineerarepre-definedinthelist)

Fordescription:

Findthephrasesinthestringwhicharecontainedinthepre-definedlist.

Pre-definedastopwordlist.Removestopword

PorterStemmingAlgorithm(https://github.com/dlwh/epic,PorterStemmer.scala)

ConvertthestringtowordsandphraseList

step5.CalculateIDF

ThedocumentfrequencyDF(t,D)isthenumberofdocumentsthatcontainstermt.

IDIisthetotalnumberofdocumentsinthecorpus.

IDF(t,D)=log((IDI+1)/(DF(t,D)+1))

step6.SaveFileonS3

key,index,IDF

3.ClassifierModelTraining

step1.LoadfeatureMapwhichispre-calculateinpreviousoperation

step2.BinaryFeatureExtractor

step3.LoadListofJobs

step4.TrainMinor

step5.TrainArbitrator

4.ClassificationSystem

MajorGroupClassificationSystem

MinorGroupClassificationSystem

References:

http://www.scalanlp.org/

相关推荐