新一代聚类搜索引擎
目前包括百度,google,搜搜,Yahoo等搜索引擎提供的是通用搜索方式,我们试想一下,如果将搜索出来的结果自动分类,那应该是多么美妙的一件事情,如您搜索“Ajax”,会自动按如下分类,如下图:
目前已有此类的开源项目,Carrot2,使用起来非常简单,但由于中文的聚类算法和英文的聚类算法存在比较大的差异,所以更多的时间是花在中文算法的聚类上,Carrot2的官方地址:http://project.carrot2.org/
目前搜索引擎逐步在细分市场,目前市面上还存在多个垂直搜索引擎,人肉搜索(其实主要也是人的相关性研究),如Google的生活搜索等,的确,现在的搜索引擎产品正慢慢的更加贴近人性化设计。
Carrot2自带的一个文档聚类的例子源代码如下:
try { /* * Initialize local controller. Normally you'd run this only once * for an entire application (controller is thread safe). */ final LocalController controller = initLocalController(); /* * Once we have a controller we can run queries. Change the query * to something that is relevant to the data in your index. */ // Data for clustering, containing documents consisting of // titles and bodies of documents. String [][] documents = new String [] [] { { "Data Mining - Wikipedia", "http://en.wikipedia.org/wiki/Data_mining" }, { "KD Nuggets", "http://www.kdnuggets.com/" }, { "The Data Mine", "http://www.the-data-mine.com/" }, { "DMG", "http://www.dmg.org/" }, { "Data Mining", "http://www.gr-fx.com/graf-fx.htm" }, { "Data Mining Benchmarking Association (DMBA)", "http://www.dmbenchmarking.com/" }, { "Data Mining", "http://www.computerworld.com/databasetopics/businessintelligence/datamining" }, { "National Center for Data Mining (NCDM) - University of Illinois at Chicago", "http://www.ncdm.uic.edu/" }, }; // Although the query will not be used to fetch any data, if the data // that you're submitting for clustering is a response to some // search engine-like query, please provide it, as the clustering // algrithm may use it to improve the clustering quality. final String query = "data mining"; // The documents are provided for clustering in the // PARAM_SOURCE_RAW_DOCUMENTS parameter, which should point to // a List of RawDocuments. List documentList = new ArrayList(documents.length); for (int i = 0; i < documents.length; i++) { documentList.add(new RawDocumentSnippet( new Integer(i), // unique id of the document, can be a plain sequence id documents[i][0], // document title documents[i][1], // document body "dummy://" + i, // URL (not required for clustering) 0.0f) // document score, can be 0.0 ); } final HashMap params = new HashMap(); params.put( ArrayInputComponent.PARAM_SOURCE_RAW_DOCUMENTS, documentList); final ProcessingResult pResult = controller.query("direct-feed-lingo", query, params); final ArrayOutputComponent.Result result = (ArrayOutputComponent.Result) pResult.getQueryResult(); /* * Once we have the buffered snippets and clusters, we can display * them somehow. We'll reuse the simple text-dumping method * available in {@link Test}. */ Example.displayResults(result); } catch (Exception e) { // There shouldn't be any, but just in case. System.err.println("An exception occurred: " + e.toString()); e.printStackTrace(); }
相关推荐
章鱼之家 2020-10-29
liangwenrong 2020-07-31
IceStreamLab 2020-06-26
yanweiqi 2020-06-25
章鱼之家 2020-06-14
章鱼之家 2020-06-08
yanweiqi 2020-06-01
IceStreamLab 2020-05-31
athrenzala 2020-05-30
athrenzala 2020-05-28
chenluhan 2020-05-28
yanweiqi 2020-05-09
etedyh 2020-05-10
athrenzala 2020-04-17
huhu 2020-03-01
qiburen 2020-03-20
WEB程序员 2020-03-18
yanweiqi 2020-03-03
柳永法的个人 2020-03-03