Jsoup没有下载整个页面

时间:2014-05-04 15:23:11

标签: java html http web jsoup

网页为:http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm

我想使用Jsoup提取所有<tr class="tr_normal">元素。

我使用的代码是:

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

但是尺寸( 1350 )小于实际尺寸( 1452 )。 我将此页面复制到我的计算机上并删除了一些<tr>个元素。然后我运行相同的代码,它是正确的。看起来有太多元素,所以jsoup无法读取所有这些元素?

那是怎么回事?谢谢!

1 个答案:

答案 0 :(得分:0)

问题是内部Jsoup Http连接处理。选择器引擎没有问题。 我没有深入,但是处理http连接的专有方法总是存在问题。我建议用HttpClient替换它 - http://hc.apache.org/。如果您不能将http客户端添加为依赖项,则可能需要在处理http连接时检查Jsoup源代码。 问题是Jsoup.Connection的默认maxBodySize。请参阅更新的答案。 *我仍然保留HttpClient代码作为示例。 程序输出

  • 从文件加载= 1452
  • 从http client = 1452
  • 加载
  • 从jsoup connect = 1350
  • 加载
  • 使用maxBodySize = 1452从

    package test;
    
    import java.io.IOException;
    import java.io.InputStream;
    
    import org.apache.http.HttpResponse;
    import org.apache.http.client.ClientProtocolException;
    import org.apache.http.client.HttpClient;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.impl.client.HttpClientBuilder;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class TestJsoup {
    
        /**
         * @param args
         * @throws IOException
         */
        public static void main(String[] args) throws IOException {
            Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
            Elements es = doc.getElementsByClass("tr_normal");
            System.out.println("load from file= " + es.size());
    
            doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from http client= " + es.size());
    
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            doc = Jsoup.connect(url).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect= " + es.size());
    
            int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
            doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect using maxBodySize= " + es.size());
        }
    
        public static InputStream loadContentByHttpClient()
                throws ClientProtocolException, IOException {
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            HttpClient client = HttpClientBuilder.create().build();
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);
            return response.getEntity().getContent();
        }
    
        public static InputStream loadContentFromClasspath()
                throws ClientProtocolException, IOException {
            return TestJsoup.class.getClassLoader().getResourceAsStream(
                    "eisdeqty_pf.htm");
        }
    
    }