初识sphinx搜索引擎
Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎.
1.体验
下载源码,在linux下安装。按照http://sphinxsearch.com/docs/1.10/quick-tour.html中的步骤。
安装: $ wget http://sphinxsearch.com/downloads/sphinx-1.10-beta.tar.gz $ tar xfv sphinx-1.10-beta.tar.gz $ cd sphinx-1.10.beta $ ./configure --prefix=/usr/local/sphinx $ make && make install 修改配置文件, 将sphinx.conf.dist拷贝为sphinx.conf, 对于sphinx自带的例子,不用修改sphinx.conf $ cd /usr/local/sphinx/etc $ cp sphinx.conf.dist sphinx.conf $ vi sphinx.conf 执行数据库脚本(使用mysql数据源) $ mysql -u test < /usr/local/sphinx/etc/example.sql 建立索引 $ cd /usr/local/sphinx/etc $ ../bin/indexer --all 查询关键字test $ ../bin/search test 使用服务器/客户端方式查询,首先启动查询服务 $ ../bin/searchd 到源码目录 $ cd sphinx-1.10-beta/api 用PHP客户端查询 $ php test.php test 用java客户端查询,先执行mk.cmd编译代码 $ java -jar sphinxapi.jar test
2.索引更新/删除
在实际应用中,不太可能每次将所有的数据重新建索引。文档中有讲增量索引更新(Delta index updates),索引合并(Index merging)。定期将新增的数据做索引(idx_delta),然后合并到原索引(idx_main)上去
命令:indexer--mergeidx_mainidx_delta
通过使用参数merge-dst-rang和设置属性值,可在索引合并的时候删除原索引(idx_main)中的属性值不符合过滤条件的索引。
命令: indexer --merge <dst-index> <src-index> --merge-dst-range <attr> <min> <max>例如: indexer --merge idx_main idx_delta --merge-dst-range deleted 0 0 deleted为一个属性(Attributes), 保留属性值在0到0间的索引原索引中的属性值可以通过APIUpdateAttributes修改新版本的sphinx还有个Real-time indexes 没看
#索引合并的配置文件(在自带的例子上修改)
#mysqlcreate table sph_counter(
counter_id int primary key not null,
max_doc_id int not null,
max_delta_id int not null
);
#加个属性deleted, 值1表示删除
alter table documents add deleted int default 0;
#sphinx.conf
source src_main
{
#最大的记录ID
sql_query_pre = replace into sph_counter select 1, max(id), 0 from documents;
#所有记录(假设查询的时候没有记录写入)
sql_query = SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content, deleted FROM documents WHERE deleted = 0 and id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
}
source src_delta : src_main
{
#新增记录的ID最大值
sql_query_pre = replace into sph_counter select 1, (select max_doc_id from sph_counter) as max_doc_id, (select max(id) from documents) as max_delta_id from dual;
#新增记录
sql_query = SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content, deleted FROM documents WHERE deleted =0 and id <= (SELECT max_delta_id FROM sph_counter WHERE counter_id=1) and id > (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
#max_delta_id更新到max_doc_id
sql_query_post = update sph_counter set max_doc_id = max_delta_id where counter_id = 1 and max_delta_id > max_doc_id;
}
index idx_main
{
source = src_main
path = /usr/local/sphinx/var/data/test_main
}
index idx_delta : idx_main
{
source = src_delta
path = /usr/local/sphinx/var/data/test_delta
}
用PHP API更新索引,设置记录2的deleted为1
$cl->UpdateAttributes ( "idx_main", array("deleted"), array(2=>array(1)) );
String[] attrs = new String[1]; attrs[0] = "deleted"; long[][] values = new long[1][2]; values[0] = new long[2]; values[0][0] = 2; values[0][1] = 1; int res = cl.UpdateAttributes ( "idx_main", attrs, values );
3. 中文分词
Coreseek开源中文检索引擎, Sphinx-for-chinese
参考: