我使用Jsoup API 1.8.3
来解析使用PHP生成的网站中存在的所有链接。主页,页面等联系表单已成功解析。但是对于登录页面,它由于以下原因而失败:
获取URL时出现HTTP错误。状态= 404, https://.../info/en/loginMf.php?src=trading
失败是因为它需要有效的凭据。因此,我想跳过这样的URL。 我试图通过使用以下方法检查状态代码来完成它:
Connection.Response response=(Response) Jsoup.connect(path);//Added typecast
System.out.println(response.statusCode());
但是这个添加的类型转换在运行时给出了错误:classCastException
。
在将网址命中状态代码传递给parse()
方法之前获取状态代码的确切方法是什么?
修改
我试图采用@lonesome here给出的答案,如下所示:
try
{
Connection.Response response= Jsoup.connect(path).execute();
int statusCode=response.statusCode();
if (statusCode <= 200 && statusCode < 300) {
doc = Jsoup.connect(filename).get();//web crawling
}
}
catch(HttpStatusException http)
{
System.out.println("Status:"+http.getStatusCode());
http.printStackTrace();
}
但问题是,int statusCode=response.statusCode();
行没有被执行。这可能是因为jsoup
的工作方式。需要执行以回复@lucksch回答的响应。
答案 0 :(得分:2)
试试这个:
HttpURLConnection httpConn;
URL url = new URL("adr");
URLConnection connection = url.openConnection();
int statusCode = httpConn.getResponseCode();
if (connection instanceof HttpURLConnection) {
try{
httpConn = (HttpURLConnection) connection;
if (statusCode <= 200 && statusCode < 300) {
// means the connection was successful
//do crawling
}
}
}
catch (ConnectException ex) { java.util.logging.Logger.getLogger(crawler.class.getName()).log(Level.SEVERE, null, ex);} //catch the possible exception.
catch (SSLHandshakeException |SocketException | SocketTimeoutException | UnknownHostException ex) {java.util.logging.Logger.getLogger(crawler.class.getName()).log(Level.SEVERE, null, ex);
//replace crawler with the name of your program main class
答案 1 :(得分:1)
只有当您实际针对所需网站发出请求时,才会收到回复。所以这就是你得到它的方式:
Connection.Response response= Jsoup.connect(path).execute();
execute
方法返回Connection.Response
,其中包含状态代码。
答案 2 :(得分:0)
当返回不正常的HTTP响应时,JSoup会抛出HttpStatusException。这是一个演示程序,它将向您展示如何使用JSoup正确验证URL。我建立了一个网址列表,当然你已经从某个地方获得了这个列表。
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.HttpStatusException;
import org.jsoup.Jsoup;
public class JSoupMain
{
public static void main(String[] args)
{
List<String> allUrls = new ArrayList<String>();
allUrls.add("http://en.wikipedia.org");
allUrls.add("http://en.wikipedia.org/blah"); //<---This will cause a 404 status code to be returned
allUrls.add("http://mvnrepository.com/artifact/org.jsoup/jsoup/1.8.3");
System.out.println("Checking urls");
List<String> goodUrls = getGoodUrls(allUrls);
System.out.println("\r\nGood urls");
for(String url : goodUrls)
{
System.out.println(url);
}
}
private static List<String> getGoodUrls(List<String> allUrls)
{
List<String> goodUrls = new ArrayList<String>();
for(String url : allUrls)
{
try
{
Jsoup.connect(url).get();
goodUrls.add(url);
}
catch(HttpStatusException e)
{
System.out.println("Url " + url + " resulted in " + e.getStatusCode());
}
catch(IOException e)
{
e.printStackTrace();
}
}
return goodUrls;
}
}