我想获取https://www2.cslb.ca.gov/OnlineServices/CheckLicenseII/LicenseDetail.aspx?LicNum=872423
的html源代码
为此我使用这种方法,但我没有得到html源代码。
public static String getHTML(URL url) {
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
答案 0 :(得分:4)
服务器过滤掉Java的默认User-Agent
。这有效:
public static String getHTML(URL url) {
try {
final URLConnection urlConnection = url.openConnection();
urlConnection.addRequestProperty("User-Agent", "Foo?");
final InputStream inputStream = urlConnection.getInputStream();
final String html = IOUtils.toString(inputStream);
inputStream.close();
return html;
} catch (Exception e) {
throw new RuntimeException(e);
}
看起来用户代理是黑名单。默认情况下,我的JDK发送:
User-Agent: Java/1.6.0_26
请注意,我使用IOUtils
类来简化示例,但关键是:
urlConnection.addRequestProperty("User-Agent", "Foo?");