python xapian存储结构
在项目中为了支持搜索服务,我们使用xapian作为后端的搜索引擎.其因性能良好以及易用受到大家欢迎.下面是基本代码:
import xapian import posixpath def get_db_path(): XAPIAN_ROOT = '/tmp/' xapian_user_database_path = posixpath.join(XAPIAN_ROOT, u'user_index') return xapian_user_database_path def add_document(database, words): doc = xapian.Document() for w in words: doc.add_term(w) database.add_document(doc) def build_index(): user_database = xapian.WritableDatabase(get_db_path(), xapian.DB_CREATE_OR_OPEN) words = ['a', 'b', 'c'] add_document(user_database, words) def search(words, offset=0, length=10): user_database = xapian.Database(get_db_path()) enquire = xapian.Enquire(user_database) query = xapian.Query(xapian.Query.OP_AND, words) enquire.set_query(query) return enquire.get_mset(int(offset), int(length)) def _show_q_results(matches): print '%i results found.' % matches.get_matches_estimated() print 'Results 1 - %i:' % matches.size() for match in matches: print '%i: %i%% docid=%i [%s]' % (match.rank + 1, match.percent, match.docid, match.document.get_data() ) if __name__ == '__main__': #index build_index() #search _show_q_results(search(['a','b']))
虽然使用起来很简单,但是我一直对于他的存储引擎有些好奇,所以看了一下最新的存储引擎brass的实现.下面是整个数据目录的层次结构:
/tmp/user_index
flintlock
iamchert
postlist.baseA
postlist.baseB
postlist.DB//存储所有term到docid的映射.
record.baseA
record.baseB
record.DB//存储所有docid到相应的数据的映射
termlist.baseA
termlist.baseB
termlist.DB//存储所有docid到相应的term的映射.
brass存储引擎采用的数据结构是BTree.所以上面每个*.DB都是存储一个BTree的.*.baseA/B则是存储相应的.DB的meta信息.包括这个大的数据文件有哪些数据块已经被使用,哪些空闲的bitmap,以及版本号等等相关信息.
BTree在xapian中表示为NLevel.每个level对应于BTree的一层.并且维护这一层的一个cursor.用于指向当前正在访问的某一个数据块,以及数据块中的某一个位置.其中每个数据块的数据结构如下:
from @brass_table.cc /* A B-tree comprises (a) a base file, containing essential information (Block size, number of the B-tree root block etc), (b) a bitmap, the Nth bit of the bitmap being set if the Nth block of the B-tree file is in use, and (c) a file DB containing the B-tree proper. The DB file is divided into a sequence of equal sized blocks, numbered 0, 1, 2 ... some of which are free, some in use. Those in use are arranged in a tree. Each block, b, has a structure like this: R L M T D o1 o2 o3 ... oN <gap> [item] .. [item] .. [item] ... <---------- D ----------> <-M-> And then, R = REVISION(b) is the revision number the B-tree had when the block was written into the DB file. L = GET_LEVEL(b) is the level of the block, which is the number of levels towards the root of the B-tree structure. So leaf blocks have level 0 and the one root block has the highest level equal to the number of levels in the B-tree. M = MAX_FREE(b) is the size of the gap between the end of the directory and the first item of data. (It is not necessarily the maximum size among the bits of space that are free, but I can't think of a better name.) T = TOTAL_FREE(b)is the total amount of free space left in b. D = DIR_END(b) gives the offset to the end of the directory. o1, o2 ... oN are a directory of offsets to the N items held in the block. The items are key-tag pairs, and as they occur in the directory are ordered by the keys. An item has this form: I K key x C tag <--K--> <------I------> A long tag presented through the API is split up into C tags small enough to be accommodated in the blocks of the B-tree. The key is extended to include a counter, x, which runs from 1 to C. The key is preceded by a length, K, and the whole item with a length, I, as depicted above.
上面来自于xapian的注释已经很清楚的说明了每个block的数据构成.除了数据元信息,就是由item组成的小的数据单元.其中每个小的item包括I(整个数据单元的长度),K(数据单元key的长度+x(key标示符)),C(表示对应的这个key有多少个item组成,因为某一个key对应的value太大的话,会进行value切分.所以C就表示总计有多少分.而之前的那个x则表示这个单元是第几份数据,这个如果需要读取这个key的整个大value就可以根据序号x进行合并.),tag就是我们刚才说的key对应的value,只不过xapian把它定义为tag.因为他是一个通用存储结构,所以这样定义也比较说的通.比如说在一颗BTree的非叶子节点tag存储的是下一层数据块的地址.对于叶子节点来说则存储相关的数据.现在整个树的存储结构已经清晰的展示出来了.
这里面有一个问题比较有意思的是postlist的存储,设想某一个热点词包含有很多很多的docid,比如说有100w个.那么当我们进行增量更新的时候,想要把某个docid从这个term删除掉,那么怎么才能尽快查找到这个docid在哪个数据块中呢?作者采用了term+docid作为BTree的key的方式来解决这个问题.value则是所有的大于这个docid的docid.并且每个块设定一个大小.这样就能让我们能尽快的定位一个docid在哪个block中,而不用读取所有的block然后再去查找了.
xapian还支持多个reader,单线程writer的方式进行增量更新.采用的类似数据库中的MVCC的方式,这样就不会因为更新把读操作阻塞住了.
目前作者正在开发replication方式,可以支持增量更新到其他机器.这样就能做到数据可靠(不会应为单机磁盘损坏导致数据丢失)以及高可用性(单机不可用,应用层可以切换到备用机器上)了.
BTW:我这两天看了xapiandevel的邮件列表,虽然没有提交问题,但是看了作者(OllyBetts)对于每个问题都会给出耐心又详尽的答复,他人真的是很好.很是佩服.