我正在尝试使用jsoup阅读大量的html页面。我有一个名为“allPageLinks”的arraylist,它保留了html页面链接。这是我的代码:
Document doc;
for (int i = 0; i < allPageLinks.size(); i++) {
try {
doc = Jsoup.connect(allPageLinks.get(i)).timeout(0).get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips
.getElementById("content");
Elements product_grid = page_clip_content
.select(".product-list.margin-left-5");
Elements products = product_grid.get(0).children();
for (int j = 0; j < products.size(); j++) {
try {
String productName = products.get(j)
.getElementsByClass("name").text();
String productPrice = products.get(j)
.getElementsByClass("price").text();
String productLink = products.get(j)
.getElementsByClass("image").select("a")
.first().attr("href");
Document newDoc = Jsoup.connect(productLink).get();
Elements elements = newDoc.getElementsByClass("left");
Elements productNameElement = elements.get(0)
.getElementsByClass("colorbox");
String productImage = productNameElement.attr("href");
elements = newDoc.getElementsByClass("right");
String productId = elements.get(0)
.getElementsByClass("field").get(1).text();
writer.append(productName);
writer.append(';');
writer.append(productPrice);
writer.append(';');
writer.append(productId);
writer.append(';');
writer.append(productImage);
writer.append(';');
writer.append(productLink);
writer.append('\n');
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i) + " ICTEKICATCH");
}
}
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i));
}
}
即使我将连接超时设置为零,我也会为大多数链接获得大量的连接超时异常。任何人都可以帮助我摆脱这种例外吗?
由于
答案 0 :(得分:0)
您忘记在代码循环中添加指定此连接的超时:
Document newDoc = Jsoup.connect(productLink).get();
应该是:
Document newDoc = Jsoup.connect(productLink).timeout(0).get();
这是最有可能发生超时异常的地方。