Question

有没有办法在XML文件中找到编码问题？我正在尝试使用doc中的XML库解析这样一个文件（我们称之为R），但编码似乎存在问题。

xmlInternalTreeParse(doc, asText=TRUE)
Error: Document labelled UTF-16 but has UTF-8 content.
Error: Input is not proper UTF-8, indicate encoding!
Error: Premature end of data in tag ...

以及可能包含数据过早结束的标签列表。但是，我很确定本文档中没有提前结束。

好的，接下来尝试：

doc <- iconv(doc, to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

再次列出标签以及行号。我检查过这些线条，但我找不到任何错误。

另一种怀疑：文档中出现的“μ”字符可能会导致错误。接下来尝试：

doc <- iconv(doc, to="UTF-8")
doc <- gsub("µ", "micro", doc)
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

还有其他调试建议吗？

编辑：在花了两天时间尝试修复错误后，我仍然没有找到解决方案。但是，我想我已经缩小了可能的答案。这是我发现的：

将XML字符串从源数据库复制到文件中，并将其另存为Notepad ++中的单独xml文件 - ＆gt; Document labelled UTF-16 but has UTF-8 content。
在此文件中将<?xml version="1.0" encoding="utf-16"?>更改为<?xml version="1.0" encoding="utf-8"?>（或encoding="latin1"） - ＆gt; 没有错误
通过XML从数据库中读取doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1]字符串，使用str_sub(doc, 35, 36) <- "8"或str_sub(doc, 31, 36) <- "latin1"进行操作，然后尝试使用xmlInternalTreeParse(doc)进行解析 - ＆GT; Premature end of data in tag...
如上所述从数据库中读取XML字符串，然后尝试使用xmlInternalTreeParse(doc) - ＆gt;解析它Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag...（标签列表如下）。
如上所述从数据库中读取XML字符串并使用xmlInternalTreeParse(doc, encoding="latin1")解析 - ＆gt; Premature end of data in tag...
在解析之前使用doc <- iconv(doc[1,1], to="UTF-8")或to="latin1"不会改变任何内容

非常感谢任何建议。

Answer 1

出现编码问题是因为原始XML文件的编码和存储XML内容为longtext的SQL数据库中的编码不匹配。替换XML字符串中的编码规范并转换此字符串解决了问题：

doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE)
doc <- iconv(doc[1,1], to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
doc <- xmlInternalTreeParse(doc, asText = TRUE)

在从数据库检索期间截断XML字符串被证明是一个单独的问题。此处提供了解决方案：How to retrieve a very long XML-string from an SQL database with R?

调试编码问题（R XML）

1 个答案: