htmlcleaner 使用示例.
htmlcleaner 使用示例.
编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得htmlcleaner比htmlparser好用。htmlcleaner的xpath特好用。也可能我对htmlparser不熟悉。
htmlcleaner下载地址:htmlcleaner2_1.jar源码下载:htmlcleaner2_1-all.zip
写一个测试用的html文件:html-clean-demo.html
1.<!DOCTYPEhtmlPUBLIC"-//W3C//DTDXHTML1.0Transitional""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2.<htmlxmlns="http://www.w3.org/1999/xhtml"xml:lang="zh-CN"dir="ltr">
3.<head>
4.<metahttp-equiv="Content-Type"content="text/html;charset=GBK"/>
5.<metahttp-equiv="Content-Language"content="zh-CN"/>
6.<title>htmlcleandemo</title>
7.</head>
8.<body>
9.<divclass="d_1">
10.<ul>
11.<li>bar</li>
12.<li>foo</li>
13.<li>gzz</li>
14.</ul>
15.</div>
16.<div>
17.<ul>
18.<li><aname="my_href"href="1.html">text-1</a></li>
19.<li><aname="my_href"href="2.html">text-2</a></li>
20.<li><aname="my_href"href="3.html">text-3</a></li>
21.<li><aname="my_href"href="4.html">text-4</a></li>
22.</ul>
23.</div>
24.</body>
25.</html>
模拟需求:取出title,name="my_href"的链接,div的class="d_1"下的所有li内容。下面用htmlcleaner写代码,HtmlCleanerDemo.java
packagecom.chenlb;
importjava.io.File;
importorg.htmlcleaner.HtmlCleaner;
importorg.htmlcleaner.TagNode;
/**
*htmlcleaner使用示例.
*
*@authorchenlb2008-11-26下午02:12:02
*/
publicclassHtmlCleanerDemo{
publicstaticvoidmain(String[]args)throwsException{
HtmlCleanercleaner=newHtmlCleaner();
TagNodenode=cleaner.clean(newFile("html/html-clean-demo.html"),"GBK");
//按tag取.
Object[]ns=node.getElementsByName("title",true);//标题
if(ns.length>0){
System.out.println("title="+((TagNode)ns[0]).getText());
}
System.out.println("ul/li:");
//按xpath取
ns=node.evaluateXPath("//div[@class='d_1']//li");
for(Objecton:ns){
TagNoden=(TagNode)on;
System.out.println("\ttext="+n.getText());
}
System.out.println("a:");
//按属性值取
ns=node.getElementsByAttValue("name","my_href",true,true);
for(Objecton:ns){
TagNoden=(TagNode)on;
System.out.println("\thref="+n.getAttributeByName("href")+",text="+n.getText());
}
}
}
cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner对不规范的html兼容性比较好。