我正在使用Java JRE 1.8.0_141,我正在尝试访问一个特定的URL并将HTML存储到一个String中,以便我可以在代码中稍后操作数据,但每当我调用getInputStream时我都会收到错误405( )。
代码似乎与其他URL一起使用没有问题。麻烦的URL是:
这是Eclipse 4.6.3的具体错误:
<terminated, exit value: 1>C:\Program Files\Java\jre1.8.0_141\bin\javaw.exe (Aug 6, 2017, 10:53:37 PM)
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Server returned HTTP response code: 405 for URL: http://www.streeteasy.com/for-rent/nyc/status:open%7Cprice:1750-2900%7Carea:104,116,119,143,141%7Camenities:pool?page=2&refined_search=true
at RunMe.getHTMLFromURL(RunMe.java:52)
at RunMe.main(RunMe.java:18)
Caused by: java.io.IOException: Server returned HTTP response code: 405 for URL: http://www.streeteasy.com/for-rent/nyc/status:open%7Cprice:1750-2900%7Carea:104,116,119,143,141%7Camenities:pool?page=2&refined_search=true
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at RunMe.getHTMLFromURL(RunMe.java:36)
... 1 more
我的RunMe.java代码如下:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.LinkedList;
public class RunMe {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
System.out.println(getHTMLFromURL("http://www.streeteasy.com/for-rent/nyc/status:open%7Cprice:1750-2900%7Carea:104,116,119,143,141%7Camenities:pool?page=2&refined_search=true"));
}
public static String getHTMLFromURL(String url){
try{
URL urlObj = new URL(url);
URLConnection con = urlObj.openConnection();
con.setDoOutput(false);
con.connect();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
// CODE FAILS HERE ^
StringBuilder response = new StringBuilder();
String inputLine;
String newLine = System.getProperty("line.separator");
while ((inputLine = in.readLine()) != null){
response.append(inputLine + newLine);
}
in.close();
return response.toString();
}
catch (Exception e){
throw new RuntimeException(e);
}
}
}
如果不通过此方法,我是否知道如何从此URL中提取HTML?提前谢谢!
答案 0 :(得分:-1)
我对网址执行了curl
命令,看起来该网站正在尝试运行JavaScript来呈现网页。
curl -v -L -H "User-Agent: Mozilla/5.0" -H "Accept: text/html" "http://www.streeteasy.com/for-rent/nyc/status:open%7Cprice:1750-2900%7Carea:104,116,119,143,141%7Camenities:pool?page=2"
> GET /for-rent/nyc/status:open%7Cprice:1750-2900%7Carea:104,116,119,143,141%7Camenities:pool?page=2 HTTP/1.1
> Host: www.streeteasy.com
> User-Agent: Mozilla/5.0
> Accept: text/html
>
< HTTP/1.1 405 Not Allowed
// elided
<h1>Pardon Our Interruption...</h1>
<p>As you were browsing <strong>www.streeteasy.com</strong> something about your browser made us think you were a bot. There are a few reasons this might happen:</p>
<ul>
<li>You're a power user moving through this website with super-human speed.</li>
<li>You've disabled JavaScript in your web browser.</li>
<li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title='Third party browser plugins that block javascript' href='http://ds.tl/help-third-party-plugins' target='_blank'>support article</a>.</li>
</ul>
<p>After completing the CAPTCHA below, you will immediately regain access to www.streeteasy.com.</p>
除非你能以编程方式填写验证码,否则你可能会失去运气。
修改强>:
问题显然是cookie,如下面的讨论中所示。