Question

我有一些与jsoup的charset支持相关的问题，其中大多数都受到API文档引用的支持：

jsoup.Jsoup：


public static Document parse(File in, String charsetName) ...
  设置为null以从http-equiv元标记中确定（如果存在），或者回退到UTF-8 ...

这是否意味着'charset'元标记不用于检测编码？
jsoup.nodes.Document：


public void charset(Charset charset)
  ...此方法相当于OutputSettings.charset(Charset)，但另外......


public Charset charset()
  ...此方法相当于Document.OutputSettings.charset()。

这是否意味着没有“输入字符集”和“输出字符集”，并且它们确实是相同的设置？
jsoup.nodes.Document：

public void charset(Charset charset) ...删除过时的字符集/编码定义！

这会删除'http-equiv'元标记来代替'charset'元标记吗？为了向后兼容，有没有办法保留两者？
jsoup.nodes.Document.OutputSettings：

public Charset charset() 在可能的情况下（从URL或文件解析时），文档的输出字符集会自动设置为输入字符集。否则，它默认为UTF-8。

我需要知道文档是否未指定编码*。这是否意味着jsoup无法提供此信息？

*而不是默认为UTF-8，我将运行juniversalchardet。

Answer 1

文档已过期/不完整。 Jsoup确实使用charset元标记，以及http-equiv标记来检测字符集。从源头看，我们看到此方法如下所示：

public static Document parse(File in, String charsetName) throws IOException {
    return DataUtil.load(in, charsetName, in.getAbsolutePath());
}

DataUtil.load依次调用parseByteData(...)，如下所示：（Source, scroll down）

//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
  String docData;
  Document doc = null;

   if (charsetName == null) { // determine from meta. safe parse as UTF-8
    // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
    docData = Charset.forName(defaultCharset).decode(byteData).toString();
    doc = parser.parseInput(docData, baseUri);
    Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
    if (meta != null) { // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta.hasAttr("http-equiv")) {
            foundCharset = getCharsetFromContentType(meta.attr("content"));
        }
        if (foundCharset == null && meta.hasAttr("charset")) {
            try {
                if (Charset.isSupported(meta.attr("charset"))) {
                    foundCharset = meta.attr("charset");
                }
            } catch (IllegalCharsetNameException e) {
                foundCharset = null;
            }
        }

        (Snip...)

以上代码段中的以下行向我们展示了它确实使用meta[http-equiv=content-type]或meta[charset]来检测编码，否则会回退到utf8。

Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();

我不太清楚你的意思，但不是，输出字符集设置控制当文档HTML / XML打印到字符串时转义的字符，而输入字符集确定如何读取文件

它只会删除meta[name=charset]个项目。从源代码中，更新/删除文档中charset定义的方法：（Source, again scroll down）

private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
    OutputSettings.Syntax syntax = outputSettings().syntax();

    if (syntax == OutputSettings.Syntax.html) {
        Element metaCharset = select("meta[charset]").first();

        if (metaCharset != null) {
            metaCharset.attr("charset", charset().displayName());
        } else {
            Element head = head();

            if (head != null) {
                head.appendElement("meta").attr("charset", charset().displayName());
            }
        }

        // Remove obsolete elements
        select("meta[name=charset]").remove();
    } else if (syntax == OutputSettings.Syntax.xml) {
    (Snip..)

基本上，如果您调用charset(...)并且它没有charset元标记，它将添加一个，否则更新现有标记。它不会触及http-equiv标签。

如果要查明文档是否指定编码，只需查找http-equiv charset或meta charset标记，如果没有此类标记，则表示文档未指定编码。

Jsoup是开源的，你可以自己查看源代码，看看它是如何工作的：https://github.com/jhy/jsoup/（你也可以修改它来做你想要的！）

我有空的时候会更详细地更新这个答案。如果您有任何其他问题，请与我们联系。

jsoup和字符编码

1 个答案: