网页为:http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm
我想使用Jsoup提取所有<tr class="tr_normal">
元素。
我使用的代码是:
Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());
但是尺寸( 1350 )小于实际尺寸( 1452 )。
我将此页面复制到我的计算机上并删除了一些<tr>
个元素。然后我运行相同的代码,它是正确的。看起来有太多元素,所以jsoup无法读取所有这些元素?
那是怎么回事?谢谢!
答案 0 :(得分:0)
问题是内部Jsoup Http连接处理。选择器引擎没有问题。 我没有深入,但是处理http连接的专有方法总是存在问题。我建议用HttpClient替换它 - http://hc.apache.org/。如果您不能将http客户端添加为依赖项,则可能需要在处理http连接时检查Jsoup源代码。
问题是Jsoup.Connection的默认maxBodySize。请参阅更新的答案。 *我仍然保留HttpClient代码作为示例。
程序输出
使用maxBodySize = 1452从package test;
import java.io.IOException;
import java.io.InputStream;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class TestJsoup {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
Elements es = doc.getElementsByClass("tr_normal");
System.out.println("load from file= " + es.size());
doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
es = doc.getElementsByClass("tr_normal");
System.out.println("load from http client= " + es.size());
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
doc = Jsoup.connect(url).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect= " + es.size());
int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect using maxBodySize= " + es.size());
}
public static InputStream loadContentByHttpClient()
throws ClientProtocolException, IOException {
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet(url);
HttpResponse response = client.execute(request);
return response.getEntity().getContent();
}
public static InputStream loadContentFromClasspath()
throws ClientProtocolException, IOException {
return TestJsoup.class.getClassLoader().getResourceAsStream(
"eisdeqty_pf.htm");
}
}