Question

我必须比较以下两个字典列表：

main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]

for m in main:
  for c in compare:
     if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          # ...

列表中约有9,000个项目。上面的代码运行大约81,000,000次（9,000 * 9,000）。我该如何加快速度？

Answer 1

您可以先按类型对列表进行排序或拆分，然后仅对每种类型进行比较。然后的问题是：您需要进行多少操作才能进行排序或拆分，以及需要进行多少比较操作。请记住，有相当有效的排序算法。

下一个优化可能是按速率排序。这样，您可以在不再满足条件m['rate'] > c['rate']时打破循环。实际上，您甚至可以执行命令和征服算法。

最重要的是，您可能会受益于Why is processing a sorted array faster than processing an unsorted array?，这不是算法上的改进，但仍然可以带来很大的改变。

让我生成一个包含9000个项目的数据集（将来，您可能希望将这样的事情包括在您的问题中，因为它使我们的生活更轻松）：

import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
    main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
    compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})

以类似

的循环运行

import time
start = time.time()
cycles = 0
for m in main:
  for c in compare:
      cycles += 1
      if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))

（在我的机器上）它产生了8100万个周期和大约30秒的结果。

按类型拆分可能看起来像这样：

# Split by types
mainsplit = {}
compsplit = {}
for t in types:
    cycles += 1
    mainsplit[t] = []
    compsplit[t] = []
for m in main:
    cycles += 1
    mainsplit[m["type"]].append(m)
for c in compare:
    cycles += 1
    compsplit[c["type"]].append(c)

# Then go through it by type
for t in types:
    for m in mainsplit[t]:
        for c in compsplit[t]:
            cycles += 1
            if m['rate'] > c['rate']:
                pass

这提供了〜14M周期，只有〜4 s。

通过“比率”对部分结果进行排序，并找到“比率”的下限：

# Then go through it by type
for t in types:
    mainsplit[t].sort(key=lambda i:i["rate"])
    compsplit[t].sort(key=lambda i:i["rate"])
    start_of_m_in_c = 0
    for m in mainsplit[t]:
        for nc in range(start_of_m_in_c, len(compsplit[t])):
            cycles += 1
            if m["rate"] > compsplit[t][nc]["rate"]:
                pass
            else:
                start_of_m_in_c = nc

周期现在为36000（不计算排序算法使用的周期），时间为30毫秒。

总而言之，性能提高了1000倍。

Answer 2

给出：

main = [
    {'id': 1, 'rate': 13, 'type': 'C'},
    {'id': 2, 'rate': 39, 'type': 'A'},
    {'id': 3, 'rate': 94, 'type': 'A'},
    {'id': 4, 'rate': 95, 'type': 'A'},
    {'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
    {'id': 119, 'rate': 33, 'type': 'D'},
    {'id': 120, 'rate': 94, 'type': 'A'}
]

您可以首先将两个字典列表映射到由type索引的字典列表的两个字典中，然后按rate对子列表进行排序：

mappings = []
for lst in main, compare:
    mappings.append({})
    for entry in lst:
        mappings[-1].setdefault(entry['type'], []).append(entry)
    for entries in mappings[-1].values():
        entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings

使main变为：

{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
 'A': [{'id': 2, 'rate': 39, 'type': 'A'},
       {'id': 3, 'rate': 94, 'type': 'A'},
       {'id': 4, 'rate': 95, 'type': 'A'},
       {'id': 5, 'rate': 96, 'type': 'A'}]}

compare变为：

{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
 'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}

，以便您在线性时间内迭代两个字典的匹配类型，并使用bisect在main的每个子列表中找到索引，其中rate更大与compare相比，时间复杂度为 O（log n），然后从该索引中遍历其余子列表进行处理。总体而言，此算法的时间复杂度为 O（n log n），比原始代码的 O（n ^ 2）时间复杂度有所提高：

from bisect import bisect

for type in main.keys() & compare.keys():
    for entry in compare[type]:
        main_entries = main[type]
        for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
            print(match['id'], entry['id'])

这将输出：

4 120
5 120

演示：https://repl.it/repls/EasygoingReadyTechnologies

免责声明：这看起来像@ThomasWeller解决方案的实现，但实际上直到完成编码后我才真正看到他的答案，这被我的其他工作打断了。另外，@ ThomasWeller希望按type对两个列表进行排序，这会导致{em> O（n log n）时间复杂性，而这可以在{{ 1}}在我的代码中循环。

Answer 3

这似乎是sqlite的工作-这是对数据库进行完全优化的东西。 Python与sqlite的绑定非常好，因此应该很合适。

这是一个起点...

import sqlite3

c = None
try:
    c = sqlite3.connect(':memory:')
    c.execute('create table main ( id integer primary key, rate integer not null,   type text not null );')
    main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
    for e in main:
        c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ',  ' +
                    str(e['rate']) + ',\"' + e['type'] + '\")')
    # now for the query
    # exercise left for the OP (but does require some SQL expertise)
except Error as e:
    print(e)
finally:
    if c:
        c.close()

Answer 4

您可以使用PyPy解释器代替经典的Cpython。它可以使您保持80％的加速速度

如何在Python中优化此循环？

4 个答案: