为什么我的Java代码可以获取某些网址(网页)的内容?

时间:2016-08-04 08:42:29

标签: java html url bufferedreader

我尝试使用我的java代码获取一些网址的内容。代码返回某些网址的内容,例如: “http://www.nytimes.com/video/world/europe/100000004503705/memorials-for-victims-of-istanbul-attack.html” 并且它对其他人没有任何回报。例如这一个: “http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0” 当我手动检查网址时,我看到了内容,即使我查看了源代码,我也没有注意到网页结构之间有任何特殊的区别。但我仍然没有得到这个网址。

是否与任何权限问题或网页结构或我的java代码有关?

这是我的代码:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class TestJsoup {
  public static void main(String[] args) {
  System.out.println(getUrlParagraphs("http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0"));
}

public static String getUrlParagraphs (String url) {
try {
  URL urlContent = new URL(url);
  BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
  String line;
  StringBuffer html = new StringBuffer();
  while ((line = in.readLine()) != null) {
    html.append(line);
    System.out.println("Test");
  }
  in.close();
  System.out.println(html.toString());
  return html.toString();
} catch (IOException e) {
    e.printStackTrace();
}
return null;
}
}

2 个答案:

答案 0 :(得分:0)

这是因为第二个重定向,并且您不会尝试遵循重定向。

尝试使用curl -v

访问它
$ curl -v 'http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0'
* Hostname was NOT found in DNS cache
*   Trying 170.149.161.130...
* Connected to www.nytimes.com (170.149.161.130) port 80 (#0)
> GET /2016/07/24/travel/mozart-vienna.html?_r=0 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.nytimes.com
> Accept: */*
> 
< HTTP/1.1 303 See Other
* Server Varnish is not blacklisted
< Server: Varnish
< Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F07%2F24%2Ftravel%2Fmozart-vienna.html%3F_r%3D1
< Accept-Ranges: bytes
< Date: Thu, 04 Aug 2016 08:45:53 GMT
< Age: 0
< X-API-Version: 5-0
< X-PageType: article
< Connection: close
< X-Frame-Options: DENY
< Set-Cookie: RMID=007f0101714857a300c1000d;Path=/; Domain=.nytimes.com;Expires=Fri, 04 Aug 2017 08:45:53 UTC
< 
* Closing connection 0

您可以看到没有内容,它是3XX返回代码,并且有一个Location:标题。

答案 1 :(得分:0)

<强>您好 问题出在你的网址中,我尝试在你的机器上编码并且它也返回null,但是我读了关于它的oracle doc并发现问题是主机,所以如果你改变了url(例如这个发布链接)它会工作正常。我的代码在这里

package sd.nctr.majid;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class Program {

    public static void main(String[] args) {
        System.out.println(getUrlParagraphs("http://stackoverflow.com/questions/4328711/read-url-to-string-in-few-lines-of-java-code"));

    }

    public static String getUrlParagraphs (String url) {
        try {
          URL urlContent = new URL(url);
          BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
          String line;
          StringBuffer html = new StringBuffer();
          while ((line = in.readLine()) != null) {
            html.append(line);
            System.out.println("Test");
          }
          in.close();
          System.out.println(html.toString());
          return html.toString();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
        }
}