为了使这种方法更快

时间:2019-06-10 07:00:23

标签: python performance elementtree

我已经在python 3中编写了此功能以合并2个xml文件。

合并是在第一级完成的,因此它不需要递归调用自身。问题是因为xml文件很大,所以要花费很多时间。请帮助我优化此代码。谢谢

这是功能:

def combine_element(one, other):
    channel_ids = []
    programs_startstop = []

    for el in one:
        if el.tag == 'channel':
            channel_ids.append(el.get('id'))
        elif el.tag == 'programme':
            programs_startstop.append((el.get('start'), el.get('stop')))

    i = 0
    printProgressBar(i, len(other), prefix = 'Progress:', suffix = 'Complete', length = 50)
    for el in other:
        if el.tag == 'channel':
            if not el.get('id') in channel_ids:
                one.append(el)
                channel_ids.append(el.get('id'))
        elif el.tag == 'programme':
            if not (el.get('start'), el.get('stop')) in programs_startstop:
                one.append(el)
                programs_startstop.append((el.get('start'), el.get('stop')))
        i += 1
        printProgressBar(i, len(other), prefix = 'Progress:', suffix = 'Complete', length = 50)

这是要合并的xml文件的示例:

第一个文件:

<tv>
 <channel id="C1">
  <display-name lang="en">C1</display-name>
 </channel>
 <channel id="C2">
  <display-name lang="en">C2</display-name>
 </channel>
 <programme channel="C1" start="20190607040000 +0000" stop="20190607043000 +0000">
  <title lang="en">P1</title>
  <desc lang="en">Program 1</desc>
 </programme>
 <programme channel="C2" start="20190707040000 +0000" stop="20190707043000 +0000">
  <title lang="en">P2</title>
  <desc lang="en">Program 2</desc>
 </programme>
</tv>

第二个文件:

<tv>
 <channel id="C3">
  <display-name lang="en">C3</display-name>
 </channel>
 <channel id="C4">
  <display-name lang="en">C4</display-name>
 </channel>
 <programme channel="C3" start="20190607070000 +0000" stop="20190607073000 +0000">
  <title lang="en">P3</title>
  <desc lang="en">Program 3</desc>
 </programme>
 <programme channel="C4" start="20190707050000 +0000" stop="20190707063000 +0000">
  <title lang="en">P4</title>
  <desc lang="en">Program 2</desc>
 </programme>
</tv>

该代码应该忽略第二个文件中的元素,因为它具有相同的id,并且如果第二个文件中的程序在第一个文件中的开始和结束时间相同,则忽略该程序。这里给出的xml代码只是一个示例,因为我无法共享实际数据。

这是该方法的预期结果,但速度更快:

<tv>
<channel id="C1">
  <display-name lang="en">C1</display-name>
 </channel>
 <channel id="C2">
  <display-name lang="en">C2</display-name>
 </channel>
<programme channel="C1" start="20190607040000 +0000" stop="20190607043000 +0000">
  <title lang="en">P1</title>
  <desc lang="en">Program 1</desc>
 </programme>
 <programme channel="C2" start="20190707040000 +0000" stop="20190707043000 +0000">
  <title lang="en">P2</title>
  <desc lang="en">Program 2</desc>
 </programme>
 <channel id="C3">
  <display-name lang="en">C3</display-name>
 </channel>
 <channel id="C4">
  <display-name lang="en">C4</display-name>
 </channel>
 <programme channel="C3" start="20190607070000 +0000" stop="20190607073000 +0000">
  <title lang="en">P3</title>
  <desc lang="en">Program 3</desc>
 </programme>
 <programme channel="C4" start="20190707050000 +0000" stop="20190707063000 +0000">
  <title lang="en">P4</title>
  <desc lang="en">Program 2</desc>
 </programme>
</tv>

1 个答案:

答案 0 :(得分:1)

您应该将检索元素的位置提取到生成键值元组对的生成器函数中。

从两个参数上都调用generator函数的结果创建字典并合并字典。

def elements(lst):
    for el in lst:
        if el.tag == 'channel':
            yield el.get('id'), el
        if el.tag == 'programme':
            yield (el.get('start'), el.get('stop')), el

def combine_element(one, other):
    one_els = elements(one)
    other_els = elements(other)

    merged_els = dict(other_els)
    merged_els.update(one_els)

    result_els = []
    progressend = len(merged_els)
    for i, (_k, el) in enumerate(merged_els.items()):
        printProgressBar(
            i, progressend, prefix='Progress:', suffix='Complete', length=50)
        result_els.append(el)

    return result_els