Classification(3)Generate Features and Stem Adjust the Model System
Classification(3)GenerateFeaturesandStemAdjusttheModelSystem
1.ScalaOperation
StringMethod-contains
scala>vallongContent="carllovetostudypython,scala"
longContent:String=carllovetostudypython,scala
scala>longContent.contains("python")
res0:Boolean=true
MapMergeFunction
Directlyunderthetheprojectwhichwealreadyhavethejardependencies.
>sbtconsole
scala>importscalaz.Scalaz._
importscalaz.Scalaz._
scala>
scala>valm1=Map("0"->0,"1"->1)
m1:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1)
scala>valm2=Map("2"->2)
m2:scala.collection.immutable.Map[String,Int]=Map(2->2)
scala>valm3=m1|+|m2
m3:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1,2->2)
MapOperation
scala>m3
res1:scala.collection.immutable.Map[String,Int]=Map(0->0,1->1,2->2)
scala>m3-"0"
res2:scala.collection.immutable.Map[String,Int]=Map(1->1,2->2)
Magicscalaz
https://github.com/scalaz/scalaz
Sliding
scala>(1to5).iterator.sliding(3).toList
res3:List[Seq[Int]]=List(List(1,2,3),List(2,3,4),List(3,4,5))
ListOperation
scala>List(1,2,3).zip(List("one","two","three"))
res8:List[(Int,String)]=List((1,one),(2,two),(3,three))
RunwithAssemblyJar
./spark-submit—num-executors2—driver-memory2G—classcom.sillycat.jobs.GenerateFeatureMap${path_to_jar}
NiceConfigurationinbuild.sbt
//There'saproblemwithjackson2.5+withSpark1.4.1
dependencyOverrides++=Set(
"com.fasterxml.jackson.core"%"jackson-databind"%"2.4.4"
)
WhenwebuildassemblyJar,WemayjustneedSparkCoreandrelatedprovided
"org.apache.spark"%%"spark-core"%"1.4.1"%"provided",//Apachev2
"org.apache.spark"%%"spark-mllib"%"1.4.1"%"provided",//Apachev2
2.DetailOperations
GenerateFeatureMap
step1.LoadJobInfofromS3(Onlytitleanddescription),cache()
step2.PlacethetitleanddescriptioninObject,RegextoFindtheTitleandDescriptionagain
step3.NormalizetheString
Fortitle:toLower—>filterallhtml—>stripChars,onlykeep[a-zA-Z\d\-]
Fordescription:toLower—>filterURL—>filterHTML—>stripChar—>stripNumber
step4.TokenizetheString
Wepredefinedalistofphrasesandstoredintextfile.2wordsand3words.
ForTitle:
Findthephrasesinthestringwhicharecontainedinthepre-definedlist.
ConvertthestringtowordsandphraseList
eg:bigdatasoftwareengineer—>big,data,software,engineer,bigdata,softwareengineer
(bigdataandsoftwareengineerarepre-definedinthelist)
Fordescription:
Findthephrasesinthestringwhicharecontainedinthepre-definedlist.
Pre-definedastopwordlist.Removestopword
PorterStemmingAlgorithm(https://github.com/dlwh/epic,PorterStemmer.scala)
ConvertthestringtowordsandphraseList
step5.CalculateIDF
ThedocumentfrequencyDF(t,D)isthenumberofdocumentsthatcontainstermt.
IDIisthetotalnumberofdocumentsinthecorpus.
IDF(t,D)=log((IDI+1)/(DF(t,D)+1))
step6.SaveFileonS3
key,index,IDF
3.ClassifierModelTraining
step1.LoadfeatureMapwhichispre-calculateinpreviousoperation
step2.BinaryFeatureExtractor
step3.LoadListofJobs
step4.TrainMinor
step5.TrainArbitrator
4.ClassificationSystem
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://www.scalanlp.org/