htmlcleaner 使用示例.

htmlcleaner 使用示例.

编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得htmlcleaner比htmlparser好用。htmlcleaner的xpath特好用。也可能我对htmlparser不熟悉。

htmlcleaner下载地址:htmlcleaner2_1.jar源码下载:htmlcleaner2_1-all.zip

写一个测试用的html文件:html-clean-demo.html

1.<!DOCTYPEhtmlPUBLIC"-//W3C//DTDXHTML1.0Transitional""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

2.<htmlxmlns="http://www.w3.org/1999/xhtml"xml:lang="zh-CN"dir="ltr">

3.<head>

4.<metahttp-equiv="Content-Type"content="text/html;charset=GBK"/>

5.<metahttp-equiv="Content-Language"content="zh-CN"/>

6.<title>htmlcleandemo</title>

7.</head>

8.<body>

9.<divclass="d_1">

10.<ul>

11.<li>bar</li>

12.<li>foo</li>

13.<li>gzz</li>

14.</ul>

15.</div>

16.<div>

17.<ul>

18.<li><aname="my_href"href="1.html">text-1</a></li>

19.<li><aname="my_href"href="2.html">text-2</a></li>

20.<li><aname="my_href"href="3.html">text-3</a></li>

21.<li><aname="my_href"href="4.html">text-4</a></li>

22.</ul>

23.</div>

24.</body>

25.</html>

模拟需求:取出title,name="my_href"的链接,div的class="d_1"下的所有li内容。下面用htmlcleaner写代码,HtmlCleanerDemo.java

packagecom.chenlb;

importjava.io.File;

importorg.htmlcleaner.HtmlCleaner;

importorg.htmlcleaner.TagNode;

/**

*htmlcleaner使用示例.

*

*@authorchenlb2008-11-26下午02:12:02

*/

publicclassHtmlCleanerDemo{

publicstaticvoidmain(String[]args)throwsException{

HtmlCleanercleaner=newHtmlCleaner();

TagNodenode=cleaner.clean(newFile("html/html-clean-demo.html"),"GBK");

//按tag取.

Object[]ns=node.getElementsByName("title",true);//标题

if(ns.length>0){

System.out.println("title="+((TagNode)ns[0]).getText());

}

System.out.println("ul/li:");

//按xpath取

ns=node.evaluateXPath("//div[@class='d_1']//li");

for(Objecton:ns){

TagNoden=(TagNode)on;

System.out.println("\ttext="+n.getText());

}

System.out.println("a:");

//按属性值取

ns=node.getElementsByAttValue("name","my_href",true,true);

for(Objecton:ns){

TagNoden=(TagNode)on;

System.out.println("\thref="+n.getAttributeByName("href")+",text="+n.getText());

}

}

}

cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner对不规范的html兼容性比较好。

相关推荐