TextExtract(1)Tika Basic
TextExtract(1)TikaBasic
1.Introduction
Tikasupportsalotofdifferentfileformats,includingaudio,video,picturesandtextfiles.
Tikabundlehastika-appforjar,GUIandCMDtool.
Command-lineinterface+GUI
Languageidentifier+TikaFacade+MIMEType
Parser
Thereare3files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
sourcecodeismanagedbymaven,Icandirectlybuildthat.
>mvncleaninstall-DskipTests=true
Commandordoubleclicktikka-appcanwork.
>java-jartika-app-1.10.jar--gui
Andwecanchoosefilesandchangetheviewtoseedifferentcontentswegetfromthefiles.
2.TryThePackagesinJavaCodes
ThesimplestJAVAcodetofetchthecontentoffiles.
packagecom.sillycat.resumeparse;
importjava.io.File;
importjava.io.IOException;
importorg.apache.tika.Tika;
importorg.apache.tika.exception.TikaException;
publicclassTestFunMain{
staticfinalStringfile="/opt/data/resume/3-resume.pdf";
publicstaticvoidmain(String[]args){
//CreateaTikainstancewiththedefaultconfiguration
Tikatika=newTika();
//Parseallgivenfilesandprintouttheextractedtextcontent
Stringtext=null;
try{
text=tika.parseToString(newFile(file));
}catch(IOException|TikaExceptione){
e.printStackTrace();
}
System.out.print(text);
}
}
FetchtheMetadataandIdentifyLanguage
packagecom.sillycat.resumeparse;
importjava.io.File;
importjava.io.FileInputStream;
importjava.io.FileNotFoundException;
importjava.io.IOException;
importorg.apache.tika.Tika;
importorg.apache.tika.exception.TikaException;
importorg.apache.tika.language.LanguageIdentifier;
importorg.apache.tika.metadata.Metadata;
importorg.apache.tika.parser.AutoDetectParser;
importorg.apache.tika.parser.ParseContext;
importorg.apache.tika.parser.Parser;
importorg.apache.tika.sax.BodyContentHandler;
importorg.xml.sax.SAXException;
publicclassTestFunMain{
staticfinalStringfile="/opt/data/resume/3-duffy.pdf";
publicstaticvoidmain(String[]args){
Tikatika=newTika();
Stringtext=null;
Parserparser=newAutoDetectParser();
BodyContentHandlerhandler=newBodyContentHandler();
ParseContextcontext=newParseContext();
Metadatametadata=newMetadata();
//fetchthecontent
try{
text=tika.parseToString(newFile(file));
}catch(IOException|TikaExceptione){
e.printStackTrace();
}
//System.out.print(text);
//fetchthemeta
try{
parser.parse(newFileInputStream(file),handler,metadata,context);
}catch(IOException|SAXException|TikaExceptione){
e.printStackTrace();
}
//System.out.println(handler.toString());
String[]metadataNames=metadata.names();
for(Stringname:metadataNames){
//System.out.println(name+":"+metadata.get(name));
}
//identifylanguage
try{
parser.parse(newFileInputStream(file),handler,metadata,
newParseContext());
}catch(FileNotFoundExceptione){
e.printStackTrace();
}catch(IOExceptione){
e.printStackTrace();
}catch(SAXExceptione){
e.printStackTrace();
}catch(TikaExceptione){
e.printStackTrace();
}
LanguageIdentifierobject=newLanguageIdentifier(handler.toString());
System.out.println("Languagename:"+object.getLanguage());
}
}
References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8
books
TikainAction.pdf
http://m.yiibai.com/tika/tika_content_extraction.html