Question

很抱歉不得不将URL放在标题中，但我不知道如何描述它。

无论如何......我有一个包含以下网址的文件：

https://rateyourmusic.com/film/%E7%A0%82%E3%81%AE%E5%A5%B3
https://rateyourmusic.com/film/%E7%94%9F%E3%81%8D%E3%82%8B
https://rateyourmusic.com/film/%E4%B9%B1
https://rateyourmusic.com/film/%E7%BE%85%E7%94%9F%E9%96%80

我想用Java编写一个程序，使用Jsoup打开这些URL并收集一些信息。这是该计划：

public class RymUrlTest {
    public static void main(String args[]){     
        try {
            BufferedReader br = new BufferedReader((new FileReader("list.txt")));

            String line="";
            while ((line = br.readLine()) != null) {
                Document d = Jsoup.connect(line).timeout(0).userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36").get();
            }           
            br.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

但是，我收到以下错误：

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://rateyourmusic.com/film/ç ?ã?®å¥³
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:446)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
    at RymUrlTest.main(RymUrlTest.java:15)

有人知道如何让Jsoup正确识别网址吗？

即使我尝试使用URLEncoder.encode，我仍然会收到错误。

Answer 1

您正在点击的网址是doint 302重定向到另一个网址，这是一个给您错误的网址。原始响应标头构成了问题列表中的第一个URL：

HTTP/1.1 302 Found
Server: nginx
Date: Thu, 05 Dec 2013 05:15:14 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 317
Location: http://rateyourmusic.com/film/ç ã®å¥³
Mime-Version: 1.0
X-Firefox-Spdy: 2

确保您已将JSOUP配置为遵循重定向并处理UTF-8字符集中的URL。

还尝试在firefox中打开URL并收集它发送的请求标头。在您自己的代码中使用这些请求标头。

Answer 2

首先验证line包含您期望的值，并且没有尾随换行符或回车符。问题的顶部是https，但是日志显示http。

尝试将％E4字符转换为Unicode。请改用URLEncoder.decode并将其转换为普通的String，默认情况下为UTF。然后通过它。

如果这不起作用，请尝试使用URL和InputStreams reading the page manually成为字符串，然后调用JSoup.parse(string)。 http://jsoup.org/cookbook/input/parse-document-from-string

Answer 3

或者，您可以在将URL提供给Jsoup之前解析URL：

public static void main(String args[]){     
    try {
        BufferedReader br = new BufferedReader((new FileReader("list.txt")));
        final Matcher WHITESPACE_REMOVER = Pattern.compile("\\s+").matcher("");

        String line = "";
        while ((line = br.readLine()) != null) {
            line = WHITESPACE_REMOVER.reset(line).replaceAll("%20");
            String url = URI.create(line).toASCIIString();
            Document d = Jsoup.connect(url).timeout(0).userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36").get();
        }           
        br.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

在Jsoup中，如何使用“https://rateyourmusic.com/film/%E4%B9%B1”等URL连接和读取页面？

3 个答案: