htmlcleaner 使用示例.

原文出处:http://blog.chenlb.com/2008/11/htmlcleaner-use-demo.html

<!-- google_ad_section_start -->

编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得 htmlcleaner 比 htmlparser 好用。htmlcleaner 的 xpath特好用。也可能我对htmlparser不熟悉。

htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip

写一个测试用的html文件:html-clean-demo.html

package
 com.chenlb;  



  


import
 java.io.File;  



  


import
 org.htmlcleaner.HtmlCleaner;  



import
 org.htmlcleaner.TagNode;  



  


/**
 


 * htmlcleaner 使用示例.
 


 *
 


 * @author chenlb 2008-11-26 下午02:12:02
 


 */
  



public
 
class
 HtmlCleanerDemo {  



  


    public
 
static
 
void
 main(String[] args) 
throws
 Exception {  



        HtmlCleaner cleaner = new
 HtmlCleaner();  



  


        TagNode node = cleaner.clean(new
 File(
"html/html-clean-demo.html"
), 
"GBK"
);  



        //按tag取.
  



        Object[] ns = node.getElementsByName("title"
, 
true
);    
//标题
  



  


        if
(ns.length > 
0
) {  



            System.out.println("title="
+((TagNode)ns[
0
]).getText());  



        }  


        System.out.println("ul/li:"
);  



        //按xpath取
  



        ns = node.evaluateXPath("//div[@class='d_1']//li"
);  



        for
(Object on : ns) {  



            TagNode n = (TagNode) on;  


            System.out.println("\ttext="
+n.getText());  



        }  


        System.out.println("a:"
);  



        //按属性值取
  



        ns = node.getElementsByAttValue("name"
, 
"my_href"
, 
true
, 
true
);  



        for
(Object on : ns) {  



            TagNode n = (TagNode) on;  


            System.out.println("\thref="
+n.getAttributeByName(
"href"
)+
", text="
+n.getText());  



        }  


    }  


}  
package com.chenlb;

import java.io.File;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * htmlcleaner 使用示例.
 *
 * @author chenlb 2008-11-26 下午02:12:02
 */
public class HtmlCleanerDemo {

	public static void main(String[] args) throws Exception {
		HtmlCleaner cleaner = new HtmlCleaner();

		TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK");
		//按tag取.
		Object[] ns = node.getElementsByName("title", true);	//标题

		if(ns.length > 0) {
			System.out.println("title="+((TagNode)ns[0]).getText());
		}
		System.out.println("ul/li:");
		//按xpath取
		ns = node.evaluateXPath("//div[@class='d_1']//li");
		for(Object on : ns) {
			TagNode n = (TagNode) on;
			System.out.println("\ttext="+n.getText());
		}
		System.out.println("a:");
		//按属性值取
		ns = node.getElementsByAttValue("name", "my_href", true, true);
		for(Object on : ns) {
			TagNode n = (TagNode) on;
			System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText());
		}
	}
}

cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。

相关推荐