我对JSoup库的编码行为有一些问题。
我想解析一个网页的内容,因此我必须插入一些人的名字,这些名字也可能包含德语,如ä,ö等。
这是我正在使用的代码:
doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);
解析resp的html。网页。
但是当我查看文档时,ä显示如下:
Käse
编码时我做错了什么?
该网页包含以下标题:
<!doctype html>
<html>
<head lang="en">
<title>Käse - Semantic Scholar</title>
<meta charset="utf-8">
</html>
有人帮忙吗?谢谢:)
编辑:我尝试过Stephans的答案,它适用于网页www.semanticscholar.org,但我也在解析另一个网页, http://www.authormapper.com/如果作者的姓名包含德语变音符号,则相同的代码不适用于此网页。 有谁知道为什么这不起作用?不知道这一点非常令人尴尬......
答案 0 :(得分:3)
这是Jsoup的已知问题。以下是为Jsoup加载内容的两个选项:
选项1: 仅限JDK
InputStream is = null;
try {
// Connect to website
URL tmp = new URL(url);
HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
connection.setReadTimeout(10000);
connection.setConnectTimeout(10000);
connection.setRequestMethod("GET");
connection.connect();
// Load content for Jsoup
is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
int n;
char[] buffer = new char[4096];
Reader r = new InputStreamReader(is, "UTF-8");
Writer w = new StringBuilderWriter();
while (-1 != (n = r.read(buffer))) {
w.write(buffer, 0, n);
}
// Parse html
String html = w.toString();
Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
// Handle exception ...
} finally {
try {
if (is != null) {
is.close();
}
} catch (final IOException ioe) {
// ignore
}
}
选项2: 使用Commons IO
InputStream is = null;
try {
// Connect to website
URL tmp = new URL(url);
HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
connection.setReadTimeout(10000);
connection.setConnectTimeout(10000);
connection.setRequestMethod("GET");
connection.connect();
// Load content for Jsoup
is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
String html = IOUtils.toString(is, "UTF-8")
// Parse html
Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
// Handle exception ...
} finally {
IOUtils.closeQuietly(is);
}
最后的想法:
- Never rely on website encoding if you didn't check manually (when possible) the real encoding in use.
- Never rely on Jsoup to find somehow the right encoding.
- You can [automate encoding guessing][2]. See the previous link for details.