Lucene 4.3 官方例子建立索引搜索

singnojava

2013-06-14

-----------------------------------------------------------

IndexFiles

As we discussed in the previous walk-through, the IndexFiles class creates a Lucene Index. Let's take a look at how it does this.

The main() method parses the command-line parameters, then in preparation for instantiating IndexWriter, opens aDirectory, and instantiates StandardAnalyzerand IndexWriterConfig.

The value of the -index command-line parameter is the name of the filesystem directory where all index information should be stored. If IndexFiles is invoked with a relative path given in the -index command-line parameter, or if the -index command-line parameter is not given, causing the default relative index path "index" to be used, the index path will be created as a subdirectory of the current working directory (if it does not already exist). On some platforms, the index path may be created in a different directory (such as the user's home directory).

The -docs command-line parameter value is the location of the directory containing files to be indexed.

The -update command-line parameter tellsIndexFiles not to delete the index if it already exists. When -update is not given, IndexFiles will first wipe the slate clean before indexing any documents.

Lucene Directorys are used by the IndexWriter to store information in the index. In addition to the FSDirectory implementation we are using, there are several other Directory subclasses that can write to RAM, to databases, etc.

Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc. The Analyzer we are using is StandardAnalyzer, which creates tokens using the Word Break rules from the Unicode Text Segmentation algorithm specified in Unicode Standard Annex #29; converts tokens to lowercase; and then filters out stopwords. Stopwords are common language words such as articles (a, an, the, etc.) and other tokens that may have less value for searching. It should be noted that there are different rules for every language, and you should use the proper analyzer for each. Lucene currently provides Analyzers for a number of different languages (see the javadocs under lucene/analysis/common/src/java/org/apache/lucene/analysis).

The IndexWriterConfig instance holds all configuration for IndexWriter. For example, we set the OpenMode to use here based on the value of the -update command-line parameter.

Looking further down in the file, after IndexWriter is instantiated, you should see the indexDocs() code. This recursive function crawls the directories and creates Document objects. TheDocument is simply a data object to represent the text content from the file as well as its creation time and location. These instances are added to the IndexWriter. If the-update command-line parameter is given, theIndexWriterConfig OpenMode will be set to OpenMode.CREATE_OR_APPEND, and rather than adding documents to the index, the IndexWriter willupdate them in the index by attempting to find an already-indexed document with the same identifier (in our case, the file path serves as the identifier); deleting it from the index if it exists; and then adding the new document to the index.

Searching Files

The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher,StandardAnalyzer, (which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed with an analyzer used to interpret your query text in the same way the documents are interpreted: finding word boundaries, downcasing, and removing useless words like 'a', 'an' and 'the'. The Query object contains the results from the QueryParser which is passed to the searcher. Note that it's also possible to programmatically construct a rich Query object without using the query parser. The query parser just enables decoding the Lucene query syntax into the correspondingQuery object.

SearchFiles uses the IndexSearcher.search(query,n) method that returns TopDocs with maxn hits. The results are printed in pages, sorted by score (i.e. relevance).

------------------------------------------------------------

建立索引：

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

public class IndexFiles {

private IndexFiles() {
}

/**
* @param args
*/
public static void main(String[] args) {

  // -index command-line parameter is the name of the filesystem directory
  // where all index information should be stored
  // -docs command-line parameter value is the location of the directory
  // containing files to be indexed
  // -update command-line parameter tells IndexFiles not to delete the
  // index if it already exists. When -update is not given, IndexFiles
  // will first wipe the slate clean before indexing any documents
  String usage = "java org.apache.lucene.demo.IndexFiles"
    + " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"
    + "This indexes the documents in DOCS_PATH, creating a Lucene index"
    + "in INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  boolean create = true;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   } else if ("-update".equals(args[i])) {
    create = false;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }

  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out
     .println("Document directory '"
       + docDir.getAbsolutePath()
       + "' does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();

  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");

   Directory dir = FSDirectory.open(new File(indexPath));
   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
     analyzer);

   if (create) {
    // Create a new index in the directory, removing any previously indexed documents:
    iwc.setOpenMode(OpenMode.CREATE);
   } else {
    // Add new documents to an existing index:
    iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
   }

   // Optional: for better indexing performance, if you
   // are indexing many documents, increase the RAM
   // buffer. But if you do this, increase the max heap
   // size to the JVM (eg add -Xmx512m or -Xmx1g):
   //
   // iwc.setRAMBufferSizeMB(256.0);

IndexWriter writer = new IndexWriter(dir, iwc);
indexDocs(writer, docDir);

   // NOTE: if you want to maximize search performance,
   // you can optionally call forceMerge here. This can be
   // a terribly costly operation, so generally it's only
   // worth it when your index is relatively static (ie
   // you're done adding documents to it):
   //
   // writer.forceMerge(1);

writer.close();

   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()
     + " total milliseconds");

  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }

}

static void indexDocs(IndexWriter writer, File file)
     throws IOException {
     // do not try to index files that cannot be read
     if (file.canRead()) {
       if (file.isDirectory()) {
         String[] files = file.list();
         // an IO error could occur
         if (files != null) {
           for (int i = 0; i < files.length; i++) {
             indexDocs(writer, new File(file, files[i]));
           }
         }
       } else {

         FileInputStream fis;
         try {
           fis = new FileInputStream(file);
         } catch (FileNotFoundException fnfe) {
           // at least on windows, some temporary files raise this exception with an "access denied" message
           // checking if the file can be read doesn't help
           return;
         }

         try {

           // make a new, empty document
           Document doc = new Document();

           // Add the path of the file as a field named "path". Use a
           // field that is indexed (i.e. searchable), but don't tokenize
           // the field into separate words and don't index term frequency
           // or positional information:
           Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
           doc.add(pathField);

           // Add the last modified date of the file a field named "modified".
           // Use a LongField that is indexed (i.e. efficiently filterable with
           // NumericRangeFilter). This indexes to milli-second resolution, which
           // is often too fine. You could instead create a number based on
           // year/month/day/hour/minutes/seconds, down the resolution you require.
           // For example the long value 2011021714 would mean
           // February 17, 2011, 2-3 PM.
           doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

           // Add the contents of the file to a field named "contents". Specify a Reader,
           // so that the text of the file is tokenized and indexed, but not stored.
           // Note that FileReader expects the file to be in UTF-8 encoding.
           // If that's not the case searching for special characters will fail.
           doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

           if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
             // New index, so we just add the document (no old document can be there):
             System.out.println("adding " + file);
             writer.addDocument(doc);
           } else {
             // Existing index (an old copy of this document may have been indexed) so
             // we use updateDocument instead to replace the old one matching the exact
             // path, if present:
             System.out.println("updating " + file);
             writer.updateDocument(new Term("path", file.getPath()), doc);
           }

         } finally {
           fis.close();
         }
       }
     }

}
}

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class SearchFiles {

private SearchFiles() {
}

/**
* @param args
*/
public static void main(String[] args) throws Exception {

  String usage = "Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";
  if (args.length > 0
    && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
   System.out.println(usage);
   System.exit(0);
  }

  String index = "index";
  String field = "contents";
  String queries = null;
  int repeat = 0;
  boolean raw = false;
  String queryString = null;
  int hitsPerPage = 10;

  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    index = args[i + 1];
    i++;
   } else if ("-field".equals(args[i])) {
    field = args[i + 1];
    i++;
   } else if ("-queries".equals(args[i])) {
    queries = args[i + 1];
    i++;
   } else if ("-query".equals(args[i])) {
    queryString = args[i + 1];
    i++;
   } else if ("-repeat".equals(args[i])) {
    repeat = Integer.parseInt(args[i + 1]);
    i++;
   } else if ("-raw".equals(args[i])) {
    raw = true;
   } else if ("-paging".equals(args[i])) {
    hitsPerPage = Integer.parseInt(args[i + 1]);
    if (hitsPerPage <= 0) {
     System.err
       .println("There must be at least 1 hit per page.");
     System.exit(1);
    }
    i++;
   }
  }

  IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
  IndexSearcher searcher = new IndexSearcher(reader);
  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
  BufferedReader in = null;
  if (queries != null) {
   in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));
  } else {
   in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
  }
  QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer);
  while (true) {
   if (queries == null && queryString == null) { // prompt the user
    System.out.println("Enter query: ");
   }

String line = queryString != null ? queryString : in.readLine();

   if (line == null || line.length() == -1) {
    break;
   }

   line = line.trim();
   if (line.length() == 0) {
    break;
   }

   Query query = parser.parse(line);
   System.out.println("Searching for: " + query.toString(field));
   // 如果repeate大于0取出查出结果的前100条数据这个没有意义，demo里面这么写的
   if (repeat > 0) { // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
     searcher.search(query, null, 100);
    }
    Date end = new Date();
    System.out.println("Time: " + (end.getTime() - start.getTime())+ "ms");
   }

   doPagingSearch(in, searcher, query, hitsPerPage, raw,queries == null && queryString == null);
   if (queryString != null) {
    break;
   }
  }
  reader.close();

}

public static void doPagingSearch(BufferedReader in,IndexSearcher searcher, Query query, int hitsPerPage, boolean raw,
boolean interactive) throws IOException {

  // Collect enough docs to show 5 pages
  TopDocs results = searcher.search(query, 5 * hitsPerPage);
  ScoreDoc[] hits = results.scoreDocs;

int numTotalHits = results.totalHits;
System.out.println(numTotalHits + " total matching documents");

int start = 0;
int end = Math.min(numTotalHits, hitsPerPage);

  while (true) {
   if (end > hits.length) {
    System.out.println("Only results 1 - " + hits.length + " of "+ numTotalHits+ " total matching documents collected.");
    System.out.println("Collect more (y/n) ?");
    String line = in.readLine();
    if (line.length() == 0 || line.charAt(0) == 'n') {
     break;
    }

hits = searcher.search(query, numTotalHits).scoreDocs;
}

end = Math.min(hits.length, start + hitsPerPage);

   for (int i = start; i < end; i++) {
    if (raw) { // output raw format
     System.out.println("doc=" + hits[i].doc + " score="
       + hits[i].score);
     continue;
    }

    Document doc = searcher.doc(hits[i].doc);
    String path = doc.get("path");
    if (path != null) {
     System.out.println((i + 1) + ". " + path);
     String title = doc.get("title");
     if (title != null) {
      System.out.println("   Title: " + doc.get("title"));
     }
    } else {
     System.out.println((i + 1) + ". "
       + "No path for this document");
    }

}

   if (!interactive || end == 0) {
    break;
   }

   if (numTotalHits >= end) {
    boolean quit = false;
    while (true) {
     System.out.print("Press ");
     if (start - hitsPerPage >= 0) {
      System.out.print("(p)revious page, ");
     }
     if (start + hitsPerPage < numTotalHits) {
      System.out.print("(n)ext page, ");
     }
     System.out
       .println("(q)uit or enter number to jump to a page.");

     String line = in.readLine();
     if (line.length() == 0 || line.charAt(0) == 'q') {
      quit = true;
      break;
     }
     if (line.charAt(0) == 'p') {
      start = Math.max(0, start - hitsPerPage);
      break;
     } else if (line.charAt(0) == 'n') {
      if (start + hitsPerPage < numTotalHits) {
       start += hitsPerPage;
      }
      break;
     } else {
      int page = Integer.parseInt(line);
      if ((page - 1) * hitsPerPage < numTotalHits) {
       start = (page - 1) * hitsPerPage;
       break;
      } else {
       System.out.println("No such page");
      }
     }
    }
    if (quit)
     break;
    end = Math.min(numTotalHits, start + hitsPerPage);
   }
  }
}

}

lucene

安科网

Lucene 4.3 官方例子建立索引搜索

singnojava

IndexFiles

Searching Files

singnojava

相关推荐

十张图说清Elasticsearch原理！

MAC OS 10.15 Lucene 源码分析环境搭建

由于Elasticsearch是在 Lucene 基础上构建

Spring Boot 教程 - Elasticsearch

全文搜索Lucene之倒排索引

Elasticsearch用得好，下班下得早！

Elasticsearch对垒8大竞品技术，孰优孰劣？

1.elasticsearch单节点部署

lucene和Elasticsearch

ES索引的一些长度限制

Lucene

lucene&solr全文检索_7solr后台界面的介绍

lucene&solr全文检索_3查询索引

Lucene、Solr、ElasticSearch、hibernate-search四部曲

ElasticSearch

《从Lucene到Elasticsearch全文检索实战》的P184页

es lucene搜索及聚合流程源码分析

Net Core使用Lucene.Net和盘古分词器实现全文检索

Solr与JDK对应版本关系，Tomcat与JDK版本对应关系

全文检索Lucene

singnojava

Lucene 4.3 官方例子 建立索引 搜索

IndexFiles

Searching Files

相关推荐

Lucene 4.3 官方例子建立索引搜索