TextExtract(1)Tika Basic

TextExtract(1)TikaBasic

1.Introduction

Tikasupportsalotofdifferentfileformats,includingaudio,video,picturesandtextfiles.

Tikabundlehastika-appforjar,GUIandCMDtool.

Command-lineinterface+GUI

Languageidentifier+TikaFacade+MIMEType

Parser

Thereare3files:

http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar

http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar

http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip

sourcecodeismanagedbymaven,Icandirectlybuildthat.

>mvncleaninstall-DskipTests=true

Commandordoubleclicktikka-appcanwork.

>java-jartika-app-1.10.jar--gui

Andwecanchoosefilesandchangetheviewtoseedifferentcontentswegetfromthefiles.

2.TryThePackagesinJavaCodes

ThesimplestJAVAcodetofetchthecontentoffiles.

packagecom.sillycat.resumeparse;

importjava.io.File;

importjava.io.IOException;

importorg.apache.tika.Tika;

importorg.apache.tika.exception.TikaException;

publicclassTestFunMain{

staticfinalStringfile="/opt/data/resume/3-resume.pdf";

publicstaticvoidmain(String[]args){

//CreateaTikainstancewiththedefaultconfiguration

Tikatika=newTika();

//Parseallgivenfilesandprintouttheextractedtextcontent

Stringtext=null;

try{

text=tika.parseToString(newFile(file));

}catch(IOException|TikaExceptione){

e.printStackTrace();

}

System.out.print(text);

}

}

FetchtheMetadataandIdentifyLanguage

packagecom.sillycat.resumeparse;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileNotFoundException;

importjava.io.IOException;

importorg.apache.tika.Tika;

importorg.apache.tika.exception.TikaException;

importorg.apache.tika.language.LanguageIdentifier;

importorg.apache.tika.metadata.Metadata;

importorg.apache.tika.parser.AutoDetectParser;

importorg.apache.tika.parser.ParseContext;

importorg.apache.tika.parser.Parser;

importorg.apache.tika.sax.BodyContentHandler;

importorg.xml.sax.SAXException;

publicclassTestFunMain{

staticfinalStringfile="/opt/data/resume/3-duffy.pdf";

publicstaticvoidmain(String[]args){

Tikatika=newTika();

Stringtext=null;

Parserparser=newAutoDetectParser();

BodyContentHandlerhandler=newBodyContentHandler();

ParseContextcontext=newParseContext();

Metadatametadata=newMetadata();

//fetchthecontent

try{

text=tika.parseToString(newFile(file));

}catch(IOException|TikaExceptione){

e.printStackTrace();

}

//System.out.print(text);

//fetchthemeta

try{

parser.parse(newFileInputStream(file),handler,metadata,context);

}catch(IOException|SAXException|TikaExceptione){

e.printStackTrace();

}

//System.out.println(handler.toString());

String[]metadataNames=metadata.names();

for(Stringname:metadataNames){

//System.out.println(name+":"+metadata.get(name));

}

//identifylanguage

try{

parser.parse(newFileInputStream(file),handler,metadata,

newParseContext());

}catch(FileNotFoundExceptione){

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}catch(SAXExceptione){

e.printStackTrace();

}catch(TikaExceptione){

e.printStackTrace();

}

LanguageIdentifierobject=newLanguageIdentifier(handler.toString());

System.out.println("Languagename:"+object.getLanguage());

}

}

References:

https://tika.apache.org/

https://github.com/luohuazju/sillycat-resume-parse

http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8

books

TikainAction.pdf

http://m.yiibai.com/tika/tika_content_extraction.html