Question

所以我有2个XML文件（A和B），每个文件有大约90k的记录。

文件格式如下：

<trips>
    <trip id="" speed=""/>
              .
              .
              .
              .
</trips>

我需要将两个文件的speed属性与相同的id属性进行比较。但两个文件中的id不在同一个位置。例如，以下内容不起作用：

A = minidom.parse('A.xml')
B = minidom.parse('B.xml')

triplistA = A.getElememtByTagName('trip')
triplistB = B.getElementByTagName('trip')

i = 0

for i in range(len(triplistA)):  #A and B has same number of trip tag
    tripA = triplistA[i]
    tripB = triplistB[i]

    #get the speed from tripA and tripB and compare, then do something

这意味着我必须搜索文件B以获得相同的ID，然后才能比较速度。在最坏的情况下，它需要n ^ 2次，这对于90k记录来说非常长。

我认为在匹配一对行程后，我从文件B中删除记录，这样在下一次迭代中搜索B将花费更少的时间。我尝试使用minidom删除节点，但它花了更长的时间。因此我使用元素树来删除节点。

然后我有：

A = minidom.parse('A.xml')
triplist = A.getElementByTagName('trip')
B = ET.parse("B.xml")
rootB = B.getroot()


for tripA in triplist:
    for tripB in rootB.findall('trip'):
        if (tripB.get('id') == str(tripA.attributes['id'].value)):
            #take speed from both nodes and do something
            rootB.remove(tripB)
            break

由于文件B中的节点减少，随着时间的推移，该过程变得越来越快，但是仍然需要半个小时才能完成整个过程。

我的项目需要我多次进行比较，并且在比较速度之后还需要半小时的过程（一些模拟，这部分时间浪费是不可避免的）。所以我想知道是否有更有效的方法来搜索大型XML文件。

提前谢谢大家。

Answer 1

将两棵树都投入到dicts中，然后比较它们：

trips_a = {}
for trip in A.getElementByTagName('trip'):
    trips_a[trip.attributes['id']] = trip.attributes['id'].value
for trip in B.getElementByTagName('trip'):
    trip_value_from_B = trip.attributes['id'].value
    trip_value_from_A = trips_a[trip.attributes['id']
    # Do something with trip_value_from_A and trip_value_from_B

在Python中搜索大型XML文件的更有效方法

1 个答案: