URL
:https://www.bing.com/search?q=vevo+USIV30300367
如果我View source
以上网址(在Internet Explorer 11中),则与第一个搜索结果相关的子字符串为:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5075.1"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - Rush - [strong]Vevo[/strong][/a][/h2]"
通过Java代码,我get
这个:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5077.1"][span dir="ltr"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - …[/span][/a][/h2]"
格式有点不同(请检查[span]
标签),但更糟糕的是,视频标题已在搜索结果字符串中被截断(即"Rush - Vevo"
变为"..."
)。
为什么?如何解决?
(注意:我在这篇文章中使用“[”和“]”作为原始HTML标记分隔符的替换,以避免我的字符串在SO上格式化。)
以下是我的Java代码:
private String getWebPage(String pageURL, UserAgentBrowser uab)
{
URL url = null;
InputStream is = null;
BufferedReader br = null;
URLConnection conn = null;
StringBuilder pagedata = new StringBuilder();
String contenttype = null, charset = "utf-8";
String line = null;
try {
url = new URL(pageURL);
conn = url.openConnection();
conn.addRequestProperty("User-Agent", uab.toString());
contenttype = conn.getContentType();
int indexL = contenttype.indexOf("charset=") + 8;
if (indexL > 7) {
int indexR = contenttype.indexOf(";", indexL);
charset = (indexR == -1 ? contenttype.substring(indexL): contenttype.substring(indexL, indexR));
}
is = conn.getInputStream(); // Could throw an IOException
br = new BufferedReader(new InputStreamReader(is, charset));
while (true) {
line = br.readLine();
if (line == null) break;
pagedata.append(line);
}
} catch (MalformedURLException mue) {
// mue.printStackTrace();
} catch (IOException ioe) {
// ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// Nothing to see here
}
}
return (pagedata.length() == 0 ? null : pagedata.toString());
}
和
String pagedata = getWebPage("https://www.bing.com/search?q=vevo+USIV30300367", UserAgentBrowser.INTERNET_EXPLORER);
UserAgentBrowser.INTERNET_EXPLORER.toString()
等于:
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"