我尝试使用我的java代码获取一些网址的内容。代码返回某些网址的内容,例如: “http://www.nytimes.com/video/world/europe/100000004503705/memorials-for-victims-of-istanbul-attack.html” 并且它对其他人没有任何回报。例如这一个: “http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0” 当我手动检查网址时,我看到了内容,即使我查看了源代码,我也没有注意到网页结构之间有任何特殊的区别。但我仍然没有得到这个网址。
是否与任何权限问题或网页结构或我的java代码有关?
这是我的代码:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class TestJsoup {
public static void main(String[] args) {
System.out.println(getUrlParagraphs("http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0"));
}
public static String getUrlParagraphs (String url) {
try {
URL urlContent = new URL(url);
BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
String line;
StringBuffer html = new StringBuffer();
while ((line = in.readLine()) != null) {
html.append(line);
System.out.println("Test");
}
in.close();
System.out.println(html.toString());
return html.toString();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
答案 0 :(得分:0)
这是因为第二个重定向,并且您不会尝试遵循重定向。
尝试使用curl -v
$ curl -v 'http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0'
* Hostname was NOT found in DNS cache
* Trying 170.149.161.130...
* Connected to www.nytimes.com (170.149.161.130) port 80 (#0)
> GET /2016/07/24/travel/mozart-vienna.html?_r=0 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.nytimes.com
> Accept: */*
>
< HTTP/1.1 303 See Other
* Server Varnish is not blacklisted
< Server: Varnish
< Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F07%2F24%2Ftravel%2Fmozart-vienna.html%3F_r%3D1
< Accept-Ranges: bytes
< Date: Thu, 04 Aug 2016 08:45:53 GMT
< Age: 0
< X-API-Version: 5-0
< X-PageType: article
< Connection: close
< X-Frame-Options: DENY
< Set-Cookie: RMID=007f0101714857a300c1000d;Path=/; Domain=.nytimes.com;Expires=Fri, 04 Aug 2017 08:45:53 UTC
<
* Closing connection 0
您可以看到没有内容,它是3XX返回代码,并且有一个Location:
标题。
答案 1 :(得分:0)
<强>您好强> 问题出在你的网址中,我尝试在你的机器上编码并且它也返回null,但是我读了关于它的oracle doc并发现问题是主机,所以如果你改变了url(例如这个发布链接)它会工作正常。我的代码在这里
package sd.nctr.majid;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class Program {
public static void main(String[] args) {
System.out.println(getUrlParagraphs("http://stackoverflow.com/questions/4328711/read-url-to-string-in-few-lines-of-java-code"));
}
public static String getUrlParagraphs (String url) {
try {
URL urlContent = new URL(url);
BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
String line;
StringBuffer html = new StringBuffer();
while ((line = in.readLine()) != null) {
html.append(line);
System.out.println("Test");
}
in.close();
System.out.println(html.toString());
return html.toString();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}