对lucene in action 和其他书里面对于mergeFactor讲解的质疑
最近要做搜索了,而且公司用的就是lucene,所以自己先学习一番,看了lucene in action和今天买的一本lucene2.0+heritrix,上面对mergeFactor都是这样说的“每向索引添加mergeFactor个document时,就会有一个新的segment在磁盘建立起来......"。而对于minMergeDocs都是一笔带过,说是限制内存中文档的数量。 于是我就开始奇怪了,这两个值这么一来不就冲突了吗,两个值一样的功能,于是乎我就做了几个试验,我有81个document,然后我把mergeFactor设置为5,把minMergeDocs设置为8,把maxMergeDocs设置为45。按照书上的讲,这样每5个doc就会生成一个segment,事实怎么样呢[code]package org.apache.lucene.demo;
[code]
/**
*Copyright2004TheApacheSoftwareFoundation
*
*LicensedundertheApacheLicense,Version2.0(the"License");
*youmaynotusethisfileexceptincompliancewiththeLicense.
*YoumayobtainacopyoftheLicenseat
*
*http://www.apache.org/licenses/LICENSE-2.0
*
*Unlessrequiredbyapplicablelaworagreedtoinwriting,software
*distributedundertheLicenseisdistributedonan"ASIS"BASIS,
*WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.
*SeetheLicenseforthespecificlanguagegoverningpermissionsand
*limitationsundertheLicense.
*/import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.index.IndexWriter;
import java.io.File;
importjava.io.FileNotFoundException;
importjava.io.IOException;
import java.util.Date;class IndexFiles {
publicstaticvoidmain(String[]args)throwsIOException{
Stringusage="java"+IndexFiles.class+"<root_directory>";
if(args.length==0){
System.err.println("Usage:"+usage);
System.exit(1);
}Date start = new Date();
try{
FileINDEX_DIR=newFile(args[0]);
if(INDEX_DIR.exists()){
INDEX_DIR.delete();
}
IndexWriterwriter=newIndexWriter("index",
newStandardAnalyzer(),true);
writer.setUseCompoundFile(false);
writer.mergeFactor=5;
writer.maxMergeDocs=40;
writer.minMergeDocs=8;
indexDocs(writer, INDEX_DIR);// writer.optimize(); writer.close();
Date end = new Date();
System.out.print(end.getTime() - start.getTime()); System.out.println(" total milliseconds");
} catch (IOException e) {
System.out.println("caughta"+e.getClass()
+"\nwithmessage:"+e.getMessage());
}
}public static void indexDocs(IndexWriter writer, File file)
throwsIOException{
//donottrytoindexfilesthatcannotberead
if(file.canRead()){
if(file.isDirectory()){
String[]files=file.list();
//anIOerrorcouldoccur
if(files!=null){
for(inti=0;i<files.length;i++){
indexDocs(writer,newFile(file,files[i]));
}
}
} else {try {
if(file.getName().endsWith(".txt")){
System.out.println("adding"+file);
writer.addDocument(FileDocument.Document(file));
}
}
//atleastonwindows,sometemporaryfilesraisethis
//exceptionwithan"accessdenied"message
//checkingifthefilecanbereaddoesn'thelp
catch(FileNotFoundExceptionfnfe){
;
}
}
}
}
}
[/code]debug他在 writer.addDocument(FileDocument.Document(file)); writer.addDocument(FileDocument.Document(file));这里设上断点,然后发现在第5个document添加的时候并没有出现segment生成,而是在第8个document添加的时候出现了第一个segment的生成。接下来再做一个试验把这两个值倒过来,然后你就会发现这次,在第5个document添加的时候出现了第一个segment的生成。
所以我认为,mergeFactor只是控制segment合并的,并不控制多少个document生成一个segement,而minMergeDocs是控制多少个document生成一个segement。
另外附上我自己写的一个计算产生segement数量的算法,写得比较匆忙,可能有不对的地方,另外有一条分支没有验证就是当maxMergeDocs<minMergeDocs时,我试验他就生成了一个segment不知道为啥。
package com.sina.easy.util; public class CountSegmentNum { private int docNum = 0; private int mergefactor = 10; private int maxMergeDocs = Integer.MAX_VALUE; private int minMergeDocs = 10; private int segmentNum = 0; public CountSegmentNum(int docNum, int mergefactor, int maxMergeDocs, int minMergeDocs) { this.docNum = docNum; this.mergefactor = mergefactor; this.maxMergeDocs = maxMergeDocs; this.minMergeDocs = minMergeDocs; } public void countNum() { int i = 1; int tempmerfactormulti = mergefactor; while (true) { if (docNum == 0) { return; } if (docNum < minMergeDocs) { segmentNum++; return; } if (maxMergeDocs >= docNum) { int x = docNum / minMergeDocs; int z = x % mergefactor; if (x >= mergefactor) { segmentNum++; } segmentNum += z; docNum = docNum % minMergeDocs; }else{ if(maxMergeDocs<minMergeDocs) { segmentNum = 1; //这条分支没有详细验证,不过实际应用应该没人这么用 return; } if(maxMergeDocs< tempmerfactormulti*minMergeDocs){ int nowmerfactor = tempmerfactormulti; for(;i>=1;i--){ nowmerfactor = tempmerfactormulti/mergefactor; segmentNum+=docNum/(nowmerfactor*minMergeDocs); docNum = docNum%(nowmerfactor*minMergeDocs); } }else{ tempmerfactormulti = tempmerfactormulti*mergefactor; i++; } } } } public int getSegmentNum() { return segmentNum; } public static void main(String[] args) { CountSegmentNum csn = new CountSegmentNum(81, 5, 60, 4); csn.countNum(); System.out.println(csn.getSegmentNum()); } }