python xapian存储结构

kobeyan

2013-02-25

在项目中为了支持搜索服务,我们使用xapian作为后端的搜索引擎.其因性能良好以及易用受到大家欢迎.下面是基本代码:

import xapian
import posixpath

def get_db_path():
    XAPIAN_ROOT = '/tmp/'
    xapian_user_database_path = posixpath.join(XAPIAN_ROOT, u'user_index')
    return xapian_user_database_path

def add_document(database, words):
    doc = xapian.Document()
    for w in words:
        doc.add_term(w)
    database.add_document(doc)

def build_index():
    user_database = xapian.WritableDatabase(get_db_path(), xapian.DB_CREATE_OR_OPEN)
    words = ['a', 'b', 'c']
    add_document(user_database, words)

def search(words, offset=0, length=10):
    user_database = xapian.Database(get_db_path())
    enquire = xapian.Enquire(user_database)
    query = xapian.Query(xapian.Query.OP_AND, words)
    enquire.set_query(query)
    return enquire.get_mset(int(offset), int(length))

def _show_q_results(matches):
    print '%i results found.' % matches.get_matches_estimated()
    print 'Results 1 - %i:' % matches.size()
    for match in matches:
        print '%i: %i%% docid=%i [%s]' % (match.rank + 1,
                                          match.percent,
                                          match.docid,
                                          match.document.get_data()
                                          )
if __name__ == '__main__':
    #index 
    build_index()
    
    #search
    _show_q_results(search(['a','b']))

虽然使用起来很简单,但是我一直对于他的存储引擎有些好奇,所以看了一下最新的存储引擎brass的实现.下面是整个数据目录的层次结构:

/tmp/user_index

flintlock

iamchert

postlist.baseA

postlist.baseB

postlist.DB//存储所有term到docid的映射.

record.baseA

record.baseB

record.DB//存储所有docid到相应的数据的映射

termlist.baseA

termlist.baseB

termlist.DB//存储所有docid到相应的term的映射.

brass存储引擎采用的数据结构是BTree.所以上面每个*.DB都是存储一个BTree的.*.baseA/B则是存储相应的.DB的meta信息.包括这个大的数据文件有哪些数据块已经被使用,哪些空闲的bitmap,以及版本号等等相关信息.

BTree在xapian中表示为NLevel.每个level对应于BTree的一层.并且维护这一层的一个cursor.用于指向当前正在访问的某一个数据块,以及数据块中的某一个位置.其中每个数据块的数据结构如下:

from @brass_table.cc
/* A B-tree comprises (a) a base file, containing essential information (Block
   size, number of the B-tree root block etc), (b) a bitmap, the Nth bit of the
   bitmap being set if the Nth block of the B-tree file is in use, and (c) a
   file DB containing the B-tree proper. The DB file is divided into a sequence
   of equal sized blocks, numbered 0, 1, 2 ... some of which are free, some in
   use. Those in use are arranged in a tree.

   Each block, b, has a structure like this:

     R L M T D o1 o2 o3 ... oN <gap> [item] .. [item] .. [item] ...
     <---------- D ----------> <-M->

   And then,

   R = REVISION(b)  is the revision number the B-tree had when the block was
                    written into the DB file.
   L = GET_LEVEL(b) is the level of the block, which is the number of levels
                    towards the root of the B-tree structure. So leaf blocks
                    have level 0 and the one root block has the highest level
                    equal to the number of levels in the B-tree.
   M = MAX_FREE(b)  is the size of the gap between the end of the directory and
                    the first item of data. (It is not necessarily the maximum
                    size among the bits of space that are free, but I can't
                    think of a better name.)
   T = TOTAL_FREE(b)is the total amount of free space left in b.
   D = DIR_END(b)   gives the offset to the end of the directory.

   o1, o2 ... oN are a directory of offsets to the N items held in the block.
   The items are key-tag pairs, and as they occur in the directory are ordered
   by the keys.

   An item has this form:

           I K key x C tag
             <--K-->
           <------I------>

   A long tag presented through the API is split up into C tags small enough to
   be accommodated in the blocks of the B-tree. The key is extended to include
   a counter, x, which runs from 1 to C. The key is preceded by a length, K,
   and the whole item with a length, I, as depicted above.

上面来自于xapian的注释已经很清楚的说明了每个block的数据构成.除了数据元信息,就是由item组成的小的数据单元.其中每个小的item包括I(整个数据单元的长度),K(数据单元key的长度+x(key标示符)),C(表示对应的这个key有多少个item组成,因为某一个key对应的value太大的话,会进行value切分.所以C就表示总计有多少分.而之前的那个x则表示这个单元是第几份数据,这个如果需要读取这个key的整个大value就可以根据序号x进行合并.),tag就是我们刚才说的key对应的value,只不过xapian把它定义为tag.因为他是一个通用存储结构,所以这样定义也比较说的通.比如说在一颗BTree的非叶子节点tag存储的是下一层数据块的地址.对于叶子节点来说则存储相关的数据.现在整个树的存储结构已经清晰的展示出来了.

这里面有一个问题比较有意思的是postlist的存储,设想某一个热点词包含有很多很多的docid,比如说有100w个.那么当我们进行增量更新的时候,想要把某个docid从这个term删除掉,那么怎么才能尽快查找到这个docid在哪个数据块中呢？作者采用了term+docid作为BTree的key的方式来解决这个问题.value则是所有的大于这个docid的docid.并且每个块设定一个大小.这样就能让我们能尽快的定位一个docid在哪个block中,而不用读取所有的block然后再去查找了.

xapian还支持多个reader,单线程writer的方式进行增量更新.采用的类似数据库中的MVCC的方式,这样就不会因为更新把读操作阻塞住了.

目前作者正在开发replication方式,可以支持增量更新到其他机器.这样就能做到数据可靠(不会应为单机磁盘损坏导致数据丢失)以及高可用性(单机不可用,应用层可以切换到备用机器上)了.

BTW:我这两天看了xapiandevel的邮件列表,虽然没有提交问题,但是看了作者(OllyBetts)对于每个问题都会给出耐心又详尽的答复,他人真的是很好.很是佩服.

python

安科网

python xapian存储结构

kobeyan

kobeyan

相关推荐

python 发送get请求接口详解

python 使用tkinter+you-get实现视频下载器

python中requests模拟登录的三种方式(携带cookie/session进行请求网站)

python开发一个解析protobuf文件的简单编译器

python 下载文件的多种方法汇总

Linux Shell 如何获取参数的方法

python跨文件使用全局变量的实现

Python爬虫破解登陆哔哩哔哩的方法

python调用百度API实现人脸识别

Python调用ffmpeg开源视频处理库，批量处理视频

详解python os.path.exists判断文件或文件夹是否存在

python实现在列表中查找某个元素的下标示例

python如何获得list或numpy数组中最大元素对应的索引

Python实现列表索引批量删除的5种方法

python 爬虫如何实现百度翻译

致命错误！Python开发者的7个崩溃瞬间

针对Python开发人员的10个“疯狂”的项目构想

用Python内置模块处理ini配置文件

VS Code 中 Python 扩展的部分功能重构，支持 R 和 Julia

Python五个隐藏的特性，你可能从未听说过

kobeyan