httpclient获取网页内容没设置User Agent导致有些网站抓不取到内容
log4j:WARN No appenders could be found for logger (org.apache.commons.httpclient.HttpClient).
log4j:WARN Please initialize the log4j system properly.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<style type="text/css">
.clearfix:after {
content: ".";
display: block;
height: 0;
clear: both;
visibility: hidden;
}
.clearfix {
display:block;
}
.left {
float: left;
}
h1 {font-size: 20px;color: #6293BB;}
p {font-size: 14px;color: #6293BB;}
</style>
</head>
<body>
<div style="padding:50px 0 0 300px">
<h1>您的访问请求被拒绝</h1>
</div>
<div class="clearfix">
<div class="left" style="padding-left:120px">
<img src="/images/filenotfound.jpg" width="128" height="128" />
</div>
<div class="left" style="width:700px;padding:30px 0 0 30px">
<p>您可能使用了网络爬虫抓取ITeye网站页面!</p>
<p>ITeye网站不允许您使用网络爬虫对ITeye进行恶意的网页抓取,请您立刻停止该抓取行为!</p>
<p>如果您的网络爬虫不属于恶意抓取行为,希望ITeye网站允许你进行网页抓取,请和ITeye管理员联系,取得授权: webmaster<img src='/images/email.gif' alt="Email" />iteye.com</p>
</div>
</div>
<div style="padding:20px 0 0 500px">
</div>
</body>
</html>
HttpClient httpClient = new HttpClient(); GetMethod getMethod = new GetMethod("http://www.iteye.com/"); /** * 设计USER_AGENT 如果不设置的话就禁止了改网页的内容 */ String USER_AGENT="Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.1.2) Gecko/20090803 Fedora/3.5.2-2.fc11 Firefox/3.5.2"; String User_Agent="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5; .NET CLR 1.1.4322; .NET CLR 2.0.50727; Alexa Toolbar; MAXTHON 2.0)"; httpClient.getParams().setParameter(HttpMethodParams.USER_AGENT,User_Agent);//设置信息
相关推荐
创建一个 HttpClient 实例,这个实例需要调用 Dispose 方法释放资源,这里使用了 using 语句。接着调用 GetAsync,给它传递要调用的方法的地址,向服务器发送 Get 请求。