Jsoup中的变音符号编码有奇怪的行为

时间:2016-06-26 17:49:50

标签: java url jsoup

我对JSoup库的编码行为有一些问题。

我想解析一个网页的内容,因此我必须插入一些人的名字,这些名字也可能包含德语,如ä,ö等。

这是我正在使用的代码:

doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);

解析resp的html。网页。

但是当我查看文档时,ä显示如下:

Käse

编码时我做错了什么?

该网页包含以下标题:

<!doctype html>
<html>
    <head lang="en"> 
    <title>Käse - Semantic Scholar</title> 
    <meta charset="utf-8"> 
</html>

有人帮忙吗?谢谢:)

编辑:我尝试过Stephans的答案,它适用于网页www.semanticscholar.org,但我也在解析另一个网页, http://www.authormapper.com/

如果作者的姓名包含德语变音符号,则相同的代码不适用于此网页。 有谁知道为什么这不起作用?不知道这一点非常令人尴尬......

1 个答案:

答案 0 :(得分:3)

这是Jsoup的已知问题。以下是为Jsoup加载内容的两个选项:

选项1: 仅限JDK

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200

    int n;
    char[] buffer = new char[4096];
    Reader r = new InputStreamReader(is, "UTF-8");
    Writer w = new StringBuilderWriter();
    while (-1 != (n = r.read(buffer))) {
        w.write(buffer, 0, n);
    }

    // Parse html
    String html = w.toString();
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    try {
        if (is != null) {
            is.close();
        }
    } catch (final IOException ioe) {
        // ignore
    }
}

选项2: 使用Commons IO

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
    String html = IOUtils.toString(is, "UTF-8")

    // Parse html
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    IOUtils.closeQuietly(is);
}

最后的想法:

- Never rely on website encoding if you didn't check manually (when possible) the real encoding in use.
- Never rely on Jsoup to find somehow the right encoding.
- You can [automate encoding guessing][2]. See the previous link for details.