Question

我对JSoup库的编码行为有一些问题。

我想解析一个网页的内容，因此我必须插入一些人的名字，这些名字也可能包含德语，如ä，ö等。

这是我正在使用的代码：

doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);

解析resp的html。网页。

但是当我查看文档时，ä显示如下：

KÃ¤se

编码时我做错了什么？

该网页包含以下标题：

<!doctype html>
<html>
    <head lang="en"> 
    <title>KÃ¤se - Semantic Scholar</title> 
    <meta charset="utf-8"> 
</html>

有人帮忙吗？谢谢:)

编辑：我尝试过Stephans的答案，它适用于网页www.semanticscholar.org，但我也在解析另一个网页， http://www.authormapper.com/

如果作者的姓名包含德语变音符号，则相同的代码不适用于此网页。有谁知道为什么这不起作用？不知道这一点非常令人尴尬......

Answer 1

这是Jsoup的已知问题。以下是为Jsoup加载内容的两个选项：

选项1： 仅限JDK

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200

    int n;
    char[] buffer = new char[4096];
    Reader r = new InputStreamReader(is, "UTF-8");
    Writer w = new StringBuilderWriter();
    while (-1 != (n = r.read(buffer))) {
        w.write(buffer, 0, n);
    }

    // Parse html
    String html = w.toString();
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    try {
        if (is != null) {
            is.close();
        }
    } catch (final IOException ioe) {
        // ignore
    }
}

选项2： 使用Commons IO

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
    String html = IOUtils.toString(is, "UTF-8")

    // Parse html
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    IOUtils.closeQuietly(is);
}

最后的想法：

- Never rely on website encoding if you didn't check manually (when possible) the real encoding in use.
- Never rely on Jsoup to find somehow the right encoding.
- You can [automate encoding guessing][2]. See the previous link for details.

Jsoup中的变音符号编码有奇怪的行为

1 个答案: