搜索引擎lecene缓存

ReganHoo

2011-10-18

关注关注

Lucene的缓存机制和解决方案

概述

1、Filter Cache

2、field缓存

3、结论

4.LuceneBase缓存解决方案

概述

lucene的缓存可分为两类：filter cache和field cache。

filter cache的实现类为CachingWrapperFilter，用来缓存其他Filter的查询结果。

field cache的实现类是FieldCache，缓存用于排序的field的值。

简单来说，filter Cache用于查询缓存，field cache用于排序。

这两种缓存的生存周期都是在一个IndexReader实例内，因此提高Lucene查询性能的关键在于如何维护和使用同一个IndexReader(即IndexSearcher)。

Filter Cache

从严格意义上来说，lucene没有查询类似数据库服务器的数据高速缓存。lucene的Filter缓存实现类是CachingWrapperFilter，它缓存了查出来的bits。另外lucene还提供了FilterManager，一个单例对象，用来缓存Filter本身。

下面是CachingWrapperFilter的具体实现：

public class CachingWrapperFilter extends Filter {

protected Filter filter;

protected transient Map cache;//这是作为缓存使用的map

public CachingWrapperFilter(Filter filter) {

this.filter = filter;

}

public BitSet bits(IndexReader reader) throws IOException {

if (cache == null) {

cache = new WeakHashMap();//采用WeakHashMap实现，由JVM回收内存

}

synchronized (cache) { // check cache

BitSet cached = (BitSet) cache.get(reader);//key为IndexReader，value为BitSet，所以该缓存生存周期在一个IndexReader内

if (cached != null) {

return cached;

}

//若没有找到缓存，则重新读取

final BitSet bits = filter.bits(reader);

synchronized (cache) { // update cache

cache.put(reader, bits);

}

return bits;

}

在FilterManager里，采用Filter.hashCode()作为key的，所以使用的时候应该在自定义的Filter类中重载hashCode()方法。

例子：Filter filter=FilterManager.getInstance().getFilter(new CachingWrapperFilter(new MyFilter()));如果该filter已经存在，在FilterManager返回该Filter的缓存（带有bit缓存），否则返回本身（不带bit缓存的）。

FilterManager里有个定时线程，会定期清理缓存，以防造成内存溢出错误。

field缓存

field缓存是用来排序用的。lucene会将需要排序的字段都读到内存来进行排序，所占内存大小和文档数目相关。经常有人用lucene做排序出现内存溢出的问题，一般是因为每次查询都启动新的searcher实例进行查询，当并发大的时候，造成多个Searcher实例同时装载排序字段，引起内存溢出。

Field缓存的实现类是FieldCacheImpl，下面我们看看排序时怎么用到Field缓存的：

在IndexSearcher类里的方法，有关排序的查询都会调用到此方法：

public TopFieldDocs search(Weight weight, Filter filter, final int nDocs,Sort sort)throws IOException {

TopFieldDocCollector collector =

new TopFieldDocCollector(reader, sort, nDocs);//排序操作由TopFieldDocCollector实现

search(weight, filter, collector);//开始查询,查询结果回调Collector.collect()方法时实现排序

return (TopFieldDocs)collector.topDocs();//返回TopFieldDocs对象，这个对象和TopDocs的差异在于TopFieldDocs里包含排序字段的信息，包括字段名和字段值。其中TopFieldDocs中ScoreDoc[]的实例是FieldDoc[]

}

下面看看TopFieldDocCollector.collect()是怎么实现的：

public void collect(int doc, float score) {

if (score > 0.0f) {

totalHits++;

if (reusableFD == null)

reusableFD = new FieldDoc(doc, score);s

else {

reusableFD.score = score;

reusableFD.doc = doc;

}

reusableFD = (FieldDoc) hq.insertWithOverflow(reusableFD);//hq是FieldSortedHitQueue对象，一个PriorityQueue的子类，insertWithOverflow()实现一个固定大小的排序队列，排序靠后的对象被挤出队列

}

FieldSortedHitQueue是通过重载lessThan()方法来实现排序功能的：

protected boolean lessThan (final Object a, final Object b) {

final ScoreDoc docA = (ScoreDoc) a;

final ScoreDoc docB = (ScoreDoc) b;

// run comparators

final int n = comparators.length;

int c = 0;

for (int i=0; i<n && c==0; ++i) {

c = (fields[i].reverse) ? comparators[i].compare (docB, docA)

: comparators[i].compare (docA, docB);//通过comparators[]来进行排序，我们剩下的任务就是看看这些comparator[]是怎么构造的，怎么使用的Fieldcache的

}

// avoid random sort order that could lead to duplicates (bug #31241):

if (c == 0)

return docA.doc > docB.doc;

return c > 0;

}

comparators实在FieldSortedHitQueue的构造函数里创建的：

public FieldSortedHitQueue (IndexReader reader, SortField[] fields, int size)throws IOException {

final int n = fields.length;

comparators = new ScoreDocComparator[n];

this.fields = new SortField[n];

for (int i=0; i<n; ++i) {

String fieldname = fields[i].getField();

comparators[i] = getCachedComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());//调用getCachedComparator方法获得缓存的comparators，comparator是ScoreDocComparator的实例

if (comparators[i].sortType() == SortField.STRING) {

this.fields[i] = new SortField (fieldname, fields[i].getLocale(), fields[i].getReverse());

} else {

this.fields[i] = new SortField (fieldname, comparators[i].sortType(), fields[i].getReverse());

}

initialize (size);

}

下面看看getCachedComparator ()的实现：

static final FieldCacheImpl.Cache Comparators = new FieldCacheImpl.Cache(){

。。。

}

static ScoreDocComparator getCachedComparator (IndexReader reader, String field, int type, Locale locale, SortComparatorSource factory)throws IOException {

//以下两种不需要读取字段

if (type == SortField.DOC) return ScoreDocComparator.INDEXORDER;//按索引顺序排序

if (type == SortField.SCORE) return ScoreDocComparator.RELEVANCE;//按相关度排序

FieldCacheImpl.Entry entry = (factory != null)? new FieldCacheImpl.Entry (field, factory)

: new FieldCacheImpl.Entry (field, type, locale);

//其他类型的排序需要读取字段到缓存中

return (ScoreDocComparator)Comparators.get(reader, entry);//Comparators 是一个FieldCache的实例

}

Comparators.get()方法根据排序字段类型的不同，返回ScoreDocComparator的不同实现，下面我们看看String类型的实现，就可以知道什么时候调用fieldCache了：

static ScoreDocComparator comparatorString (final IndexReader reader, final String fieldname)

throws IOException {

final String field = fieldname.intern();

//下面代码读取缓存，得到字段值和文档id的对应关系，如果缓存不存在，则读取索引文件。缓存的生命周期是和IndexReader一样，所以不同查询使用同一个Searcher，可以保证排序缓存只有一个，不会出现内存溢出的问题

final FieldCache.StringIndex index = FieldCache.DEFAULT.getStringIndex (reader, field);

return new ScoreDocComparator () {

public final int compare (final ScoreDoc i, final ScoreDoc j) {

final int fi = index.order[i.doc];//index.order[]的值是按自定义字段的排序，数组的索引是lucene docid；可以看看getStringIndex的具体实现来看看这些值是怎么读进来的，这里就不详细说明了

final int fj = index.order[j.doc];

if (fi < fj) return -1;

if (fi > fj) return 1;

return 0;

}

public Comparable sortValue (final ScoreDoc i) {

return index.lookup[index.order[i.doc]];

}

public int sortType() {

return SortField.STRING;

}

};

}

结论

lucene使用上述的两个缓存机制已经能解决绝大部分的问题了。solr在lucene之上封装，又增加了另外的缓存，但应该说作用不太大，反而使代码变得很复杂了。

缓存解决方案

Lucene缓存的生存周期都是在一个IndexReader实例内，因此提高Lucene查询性能的关键在于如何维护和使用同一个IndexReader(即IndexSearcher)。

因此我们需要新写一个SingleIndexSearcher（源代码见下）类，该类继承IndexSearcher，作用为实现IndexSearcher的单例模式。

LuceneBase加入类SingleIndexSearcher并将IndexSearcher对象的生成都用SingleIndexSearcher. getInstance()方法。

缓存Filter用法：Filter filter = new CachingWrapperFilter(new FieldFilter(field, value));

或

Filter filter = FilterManager.getInstance().getFilter(new CachingWrapperFilter(new FieldFilter(field, value)));

/**

* IndexSearcher单例模式的实现采取单例模式是要充分利用Lucene的缓存，同时防止多个IndexSearcher对象导致内存溢出和并发问题

* @author 路卫杰

* @version 1.0, 2010-8-4

* @see IndexSearcher

public class SingleIndexSearcher extends IndexSearcher {

/** 私有静态SingleIndexSearcher对象 */

private static IndexSearcher instance;

static{

try {

instance = new SingleIndexSearcher(Configure.getProperties().getProperty("ZkAnalyzerPath"));

System.out.println("构造");

} catch (CorruptIndexException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

/**

* 构造方法

* @param path

* 索引路径

* @throws IOException

* @throws CorruptIndexException

public SingleIndexSearcher(String path) throws CorruptIndexException, IOException{

super(path);

}

/**

* 获得单例

public static IndexSearcher getInstance() {

return instance;

}

搜索速度比较

搜索相同关键字和过滤器次数（次）一般过滤器(ms) 缓存过滤器(ms) 缓存排序(ms)

1 2407 2438 2093

5 4750 2531 2219

10 8110 2672 2313

20 14750 2922 2593

50 34498 3672 3250

100 67546 4844 4407

lucene 缓存服务器缓存

安科网

搜索引擎lecene缓存

ReganHoo

概述

Filter Cache

field缓存

结论

缓存解决方案

搜索速度比较

ReganHoo

相关推荐

十张图说清Elasticsearch原理！

MAC OS 10.15 Lucene 源码分析环境搭建

由于Elasticsearch是在 Lucene 基础上构建

Spring Boot 教程 - Elasticsearch

全文搜索Lucene之倒排索引

Elasticsearch用得好，下班下得早！

Elasticsearch对垒8大竞品技术，孰优孰劣？

1.elasticsearch单节点部署

lucene和Elasticsearch

ES索引的一些长度限制

Lucene

lucene&solr全文检索_7solr后台界面的介绍

lucene&solr全文检索_3查询索引

Lucene、Solr、ElasticSearch、hibernate-search四部曲

ElasticSearch

《从Lucene到Elasticsearch全文检索实战》的P184页

es lucene搜索及聚合流程源码分析

Net Core使用Lucene.Net和盘古分词器实现全文检索

Solr与JDK对应版本关系，Tomcat与JDK版本对应关系

全文检索Lucene

ReganHoo