Question

我遇到过需要读取多个xml文件并从中构建单个模型的情况。遗憾的是，文件是由遗留系统生成的，我绝对无法改变。

给我带来麻烦的一个XML文件看起来或多或少像这样（改为删除专有数据）：

<resource lang="en" dataId="900">
 numbered content here, 900-919 ...

    <string name="920-name">Document Shredder</string>
    <string name="920-desc">A machine ideal for destroying documents that deserve it. It can cross-shred anything from tissue paper to small netbooks with minimal noise. Remember, hackers can't access the documents if you've shredded the drives.</string>
    <string name="920-cat">office,appliance</string>
    <string name="921-name">Plastic Ladle</string>
    <string name="921-desc">This is a big plastic ladle, ideal for soups and sauces.</string>
    <string name="921-cat">kitchen,utensils</string>

... similar numbered content here, 922-934 ...

    <string name="935-name">Green Laser Pointer</string>
    <string name="935-desc">A High-Powered green laser pointer, ideal for irritating cats.</string>
    <string name="935-cat">office,tool</string>
    <string name="936-name">Black Metal Filing Cabinet</string>
    <string name="936-desc">A large, metal cabinet (black) built to store hanging file folders.</string>
    <string name="936-cat">office,storage</string>

... similar numbered content here, 937-994
</resource>

我将其解析为List<CString>，其中CString.java为：

public class CString {
    public String name;
    public String desc;

    @Override
    public String toString() {
        return "CString {!name: " + name + " !body: " + body + "}\n";
    }
}

我尝试过使用DocumentBuilder，当这种方式无法正常使用时，只需使用SaxParser。不管怎么说，当我回到我的CString时，我有一些身体实际上包含文档不同部分的未解析标签。例如，打印出我前面提到的List<CString>可能会产生如下内容：

[ CStrings for 900-919 ...

, CString {!name: 920-name !body: Document Shredder}
, CString {!name: 920-desc !body: irritating cats.</string>
    <string name="935-cat">office,tool</string>
    <string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.}
, CString {!name: 920-cat !body: office,appliance}
, CString {!name: 921-name !body: Plastic Ladle}
, CString {!name: 921-desc !body: This is a big plastic ladle, ideal for soups and sauces.}
, CString {!name: 921-cat !body: kitchen,utensils}

... CStrings for 922-934 ... 

, CString {!name: 935-name !body: Green Laser Pointer}
, CString {!name: 935-desc !body: A High-Powered green laser pointer, ideal for irritating cats.}
, CString {!name: 935-cat !body: office,tool}
, CString {!name: 936-name !body: Black Metal Filing Cabinet}
, CString {!name: 936-desc !body: A large, metal cabinet (black) built to store hanging file folders.}
, CString {!name: 936-cat !body: office,storage}

... CStrings for 937-994
]

在我的代码的SaxParser版本中，我在characters中使用了以下DefaultHandler方法：

public void characters(char ch[], int start, int length) throws SAXException {
    String value = new String(ch, start, length).trim();
    switch(currentQName.toString()) { // currentQName is a StringBuilder that holds just the current xml element's name
        case "string":
            if (value.contains("</string")) {
                System.err.println("!!! Parse Error !!! " + value);
            }
}

，正如您可能已经猜到的那样，产生：

!!! Parse Error !!! irritating cats.</string>
        <string name="935-cat">office,tool</string>
        <string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.

我通常不会问这个深奥的问题，特别是当我无法提供具体的数据和代码时，但谷歌的数量似乎没有产生任何我能够确定的东西，当然代码不会抛出（或压制）任何异常。

我注意到的一件事是，当出现错误的数据时，如上面的CString for 920-desc所示，在这种情况下，错误的数据长度为138个字符，并非巧合的是，良好的数据恰好提取了139个字符应该是什么。这让我觉得它是某种缓冲问题。但是，我是否让DocumentBuilder管理缓冲区，或者我尝试使用直接SaxParser更多地手动管理缓冲区，我仍然每次都在相同的位置获得完全相同的错误文本。最后，在处理较短的字符串，名称和cat时，我没有注意到任何错误的文本，我认为这也指向了char缓冲区问题。

任何想法都会有所帮助！

Answer 1

几乎可以肯定，你没有格式良好的XML（关于绝对不允许改变源系统的评论是一个不好的预兆，但你很难独自处于这种困境中。）

看一下这个问题How to parse badly formed XML in Java?

如果我是你，我会使用原始字符串操作和/或正则表达式来直接提取数据或将其修复为格式良好的XML。顺便说一句，JAXB在处理Java中的XML方面要好得多（但仍需要它很好地构建）

Answer 2

我在代码中找到了一个地方，其中特殊字符被不必要地清理了（我想从源代码中解决格式不佳的先前问题）。

以下是完成所有剥离的方法：

private static InputSource getCleanSource(File file) {
    InputSource source = null;
    try {
        InputStream stream = new FileInputStream(file);
        String fileText = readFile(stream); // Gets file content as text from InputStream

        CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
        utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
        utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        CharBuffer parsed = utf8Decoder.decode(ByteBuffer.wrap(readFile(stream).getBytes()));

        fileText = "<?xml version=\"1.1\" encoding=\"UTF-8\" ?>\n" + // put a good header
                parsed
                .replaceAll("<\\?.*?\\?>", "") // remove bad <?xml> tags
                .replaceAll("--+","--") // can't have <!--- text --->
                .replaceFirst("(?s)^.+?<\\?", "<?") // remove bad stuff before <?xml> tag
                .replaceAll("[^\\x20-\\x7e\\x0A]", "") // remove bad characters
                .replaceAll("[\\x0A]", " ") // remove line breaks
                ;
        Reader reader = new StringReader(fileText);
        source = new InputSource(reader);
    } catch (Throwable t) {
        System.err.println("Unknown trouble parsing: " + file.getName());
        t.printStackTrace();
    }

    return source;
}

在审核并调整此内容后，如果我将此方法更改为：

，一切正常

private static InputSource getCleanSource(File file) {
    InputSource source = null;
    try {
        InputStream stream = new FileInputStream(file);
        String fileText = readFile(stream) // Gets file content as text from InputStream
                .replaceAll("--+","--") // can't have <!--- text --->
                .replaceFirst("(?s)^.+?<\\?", "<?") // remove bad stuff before <?xml> tag
                ;
        Reader reader = new StringReader(fileText);
        source = new InputSource(reader);
    } catch (Throwable t) {
        System.err.println("Unknown trouble parsing: " + file.getName());
        t.printStackTrace();
    }

    return source;
}

我还没有时间回去试图弄清楚清洁过程中正在吃什么神秘人物或标签。我不得不假设源系统最初提供的有效xml远远低于现在的那种有利于这种积极清理的xml，但我认为我不会确切知道。

Java - SaxParser / DocumentBuilder“无法”获得正确的标签主体

2 个答案: