Question

我有一个XML文件，其中包含一个ID和另一个包含相同ID的XML文件。我想交叉引用这些文件并从第二个文件中提取信息。第一个文件只包含我需要的那些ID。例如，第一个文件包含ID为345,350,353,356 第二个文件包含ID为345,346,347,348,349,350 .... 我想从第二个文件中提取数据节点及其所有子节点。

第一个文件结构：

<data>
    <node>
        <info>info</info>
        <id>345</id>
    </node>
    <node2>
        <node3>
                <info2>info</info2>
                <id>2</id>
        </node3>
        <otherinfo>1</otherinfo>
        <text type = "02">
                <role>info</role>
                <st>1</st>
        </text>
    </node2>
</data>

第二个文件结构：

<data>
    <node>
        <info>info</info>
        <id>345</id>
    </node>
    <node2>And a bunch of other nodes</node2>
    <node2>And a bunch of other nodes</node2>
    <node2>And a bunch of other nodes</node2>
</data>

我尝试过ruby / nokogiri解决方案，但我似乎无法走得太远。我愿意接受任何脚本语言的解决方案。

Answer 1

从第一个xml字符串中提取所有id值：

from lxml import etree

e1 = etree.fromstring(xml1)
ids = e1.xpath('//id/text()')

从第二个xml字符串中提取所有<node>个元素，这些元素是第一个id个已知id个元素的父元素：

import re

e2 = etree.fromstring(xml2)
ns_re = dict(re="http://exslt.org/regular-expressions")
re_id = "|".join(map(re.escape, ids))
nodes = e2.xpath("//id[re:test(.,'^(?:%s)$')]/parent::node" % re_id,
                 namespaces=ns_re)

XML交叉参考

1 个答案: