Question

由于某种原因，我不能使用Sax和DOM解析器，需要用正则表达式解析它。

我想提取键值对中的值（键是tag1中的内容，值是标记3中的内容）。但是有些键之间没有任何键值，我必须忽略这些键。

XML文件

dictA = {'a':1, 'b':2, 'c':3}
dictB = {'a':2, 'b':2, 'c':4}

if dictA == dictB:
    print "dicts are same"
else:
    # print all the diffs
    for key in dictA.iterkeys():
        try:
            if dictA[key] != dictB[key]:
                print "different value for key: %s" % key
        except KeyError:
            print "dictB has no key: %s" % key

上面带缩进的xml文件：

<Main Tag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></Main Tag>

所以从上面的文件我需要提取Key1-Value1和Key3-Value3，忽略Key2，因为它没有值。

使用匹配器：

<Main Tag>
    <element>
        <tag1>Key1</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value1</tag3>
    </element>
    <element>
        <tag1>Key2</tag1>
        <tag2>Not intrested</tag2>
    </element>
    <element>
        <tag1>Key3</tag1>
        <tag2>Not intrested</tag2>
        <tag3>Value3</tag3>
    </element>
</Main Tag>

Answer 1

您要使用的工具是XPath - 它专门针对您正在进行的工作而设计。

如果您无法使用标准工具解析XML文档，那么有一个原因，通常比使用正则表达式更容易解决这个问题。

如果您启用更详细的解析，是否看到错误？如果是，那是什么类型的？（在这种情况下，使用命令行XML解析器而不是java库可能有助于获得更好的输出。）

我在XML解析中看到的三个最常见的问题：

配置错误的命名空间：您将在验证/提取中遇到错误
一个微妙格式错误的XML文档（例如，非法字符，如0x02）。有时这些是不可打印的，所以你甚至都看不到它们。
在内存中解析太大 - 解析期间内存不足（一般是DOM问题，不是SAX）

有些解析器对这些事情或多或少都很严格，你可能想尝试一些工具，或者启用不太严格的模式。

JTidy或TagSoup可能能够解决一些不正确的XML问题，如果它是原始的HTML。

Answer 2

尝试使用此模式：

"<(tag[13])>(.+?)</tag[13]>"

用法：

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";

    Matcher matcher = Pattern.compile("<(tag[13])>(.+?)</tag[13]>").matcher(xmlString);
    while (matcher.find()) {
        System.out.println(matcher.group(1) + " " + matcher.group(2));
    }
}

结果：

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

NON REGEX

或者您可以使用Document＆amp;来自DocumentBuilderFactory包的org.wc3.dom。

类似的东西：

public static void main(String[] args) throws Exception {
    String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";
    Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new ByteArrayInputStream(xmlString.getBytes("utf-8"))));

    Node rootNode = xmlDocument.getFirstChild();
    if (rootNode.hasChildNodes()) {
        // Get each element child node
        NodeList elementsList = rootNode.getChildNodes();
        for (int i = 0; i < elementsList.getLength(); i++) {
            if (elementsList.item(i).hasChildNodes()) {
                // Get each tag child node to element node
                NodeList tagsList = elementsList.item(i).getChildNodes();
                for (int i2 = 0; i2 < tagsList.getLength(); i2++) {
                    Node tagNode = tagsList.item(i2);
                    if (tagNode.getNodeName().matches("tag1|tag3")) {
                        System.out.println(tagNode.getNodeName() + " " + tagNode.getTextContent());
                    }
                }
            }
        }
    }
}

结果：

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

使用Java Regex解析xml文件

2 个答案:

NON REGEX