解析HTML页面:Java代码和浏览器之间页面内容的差异

时间:2018-02-24 19:15:39

标签: java html

URLhttps://www.bing.com/search?q=vevo+USIV30300367

如果我View source以上网址(在Internet Explorer 11中),则与第一个搜索结果相关的子字符串为:

"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5075.1"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - Rush - [strong]Vevo[/strong][/a][/h2]"

通过Java代码,我get这个:

"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5077.1"][span dir="ltr"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - …[/span][/a][/h2]"

格式有点不同(请检查[span]标签),但更糟糕的是,视频标题已在搜索结果字符串中被截断(即"Rush - Vevo"变为"...")。

为什么?如何解决?

(注意:我在这篇文章中使用“[”和“]”作为原始HTML标记分隔符的替换,以避免我的字符串在SO上格式化。)

以下是我的Java代码:

private String getWebPage(String pageURL, UserAgentBrowser uab)
{
    URL url = null;
    InputStream is = null;
    BufferedReader br = null;
    URLConnection conn = null;
    StringBuilder pagedata = new StringBuilder();
    String contenttype = null, charset = "utf-8";
    String line = null;

    try {
        url = new URL(pageURL);
        conn = url.openConnection();
        conn.addRequestProperty("User-Agent", uab.toString());

        contenttype = conn.getContentType();
        int indexL = contenttype.indexOf("charset=") + 8;
        if (indexL > 7) {
            int indexR = contenttype.indexOf(";", indexL);
            charset = (indexR == -1 ? contenttype.substring(indexL): contenttype.substring(indexL, indexR));
        }

        is = conn.getInputStream(); // Could throw an IOException
        br = new BufferedReader(new InputStreamReader(is, charset));
        while (true) {
            line = br.readLine();
            if (line == null) break;
            pagedata.append(line);
        }
    } catch (MalformedURLException mue) {
         // mue.printStackTrace();
    } catch (IOException ioe) {
         // ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // Nothing to see here
        }
    }

    return (pagedata.length() == 0 ? null : pagedata.toString());
}

String pagedata = getWebPage("https://www.bing.com/search?q=vevo+USIV30300367", UserAgentBrowser.INTERNET_EXPLORER);

UserAgentBrowser.INTERNET_EXPLORER.toString()等于:

"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"

0 个答案:

没有答案