htmlcleaner 使用示例.
原文出处:http://blog.chenlb.com/2008/11/htmlcleaner-use-demo.html
<!-- google_ad_section_start -->编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得 htmlcleaner 比 htmlparser 好用。htmlcleaner 的 xpath特好用。也可能我对htmlparser不熟悉。
htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip
写一个测试用的html文件:html-clean-demo.html
package
com.chenlb;
import
java.io.File;
import
org.htmlcleaner.HtmlCleaner;
import
org.htmlcleaner.TagNode;
/**
* htmlcleaner 使用示例.
*
* @author chenlb 2008-11-26 下午02:12:02
*/
public
class
HtmlCleanerDemo {
public
static
void
main(String[] args)
throws
Exception {
HtmlCleaner cleaner = new
HtmlCleaner();
TagNode node = cleaner.clean(new
File(
"html/html-clean-demo.html"
),
"GBK"
);
//按tag取.
Object[] ns = node.getElementsByName("title"
,
true
);
//标题
if
(ns.length >
0
) {
System.out.println("title="
+((TagNode)ns[
0
]).getText());
}
System.out.println("ul/li:"
);
//按xpath取
ns = node.evaluateXPath("//div[@class='d_1']//li"
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\ttext="
+n.getText());
}
System.out.println("a:"
);
//按属性值取
ns = node.getElementsByAttValue("name"
,
"my_href"
,
true
,
true
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\thref="
+n.getAttributeByName(
"href"
)+
", text="
+n.getText());
}
}
} package com.chenlb;
import java.io.File;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* htmlcleaner 使用示例.
*
* @author chenlb 2008-11-26 下午02:12:02
*/
public class HtmlCleanerDemo {
public static void main(String[] args) throws Exception {
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK");
//按tag取.
Object[] ns = node.getElementsByName("title", true); //标题
if(ns.length > 0) {
System.out.println("title="+((TagNode)ns[0]).getText());
}
System.out.println("ul/li:");
//按xpath取
ns = node.evaluateXPath("//div[@class='d_1']//li");
for(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\ttext="+n.getText());
}
System.out.println("a:");
//按属性值取
ns = node.getElementsByAttValue("name", "my_href", true, true);
for(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText());
}
}
}cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。
相关推荐
lupeng 2020-11-14
sjcheck 2020-11-10
nercon 2020-08-09
pythonclass 2020-07-29
玫瑰小妖 2020-07-18
WebVincent 2020-07-09
lyg0 2020-07-05
WebVincent 2020-06-16
huzijia 2020-06-16
qsdnet我想学编程 2020-06-13
pythonclass 2020-06-06
nercon 2020-06-06
gufudhn 2020-06-06
STPace 2020-06-04
HSdiana 2020-06-03
haocxy 2020-05-31
行吟阁 2020-05-30
haocxy 2020-05-28