Python中的多线程wordcount和全局字典更新

时间：2017-04-12 10:34:48

标签： python thread-safety

在下面的代码中，目标是执行wordcount，add_counts函数同时被称为线程，这是读取和更新线程安全的操作，这个answer表示字典更新可能是线程安全但是怎么样？阅读和更新如下：

word_counts={}

@concurrent
def add_counts(line):
    for w in line.split():

        word_counts[w] = word_counts.get(w, 0) + 1

for line in somebigfile:
    add_counts(line)

1 个答案:

答案 0 :(得分：1)

读取和更新不是线程安全的 - 这是一个可以尝试在本地使用以查看实际效果的示例：

from threading import Thread


def add_to_counter(ctr):
    for i in range(100000):
        ctr['ctr'] = ctr.get('ctr', 0) + 1


ctr = {}

t1 = Thread(target=add_to_counter, args=(ctr,))
t2 = Thread(target=add_to_counter, args=(ctr,))

t1.start()
t2.start()
t1.join()
t2.join()

print(ctr['ctr'])

结果显然取决于调度和其他系统/时序相关的细节，但在我的系统上，我始终在200000下得到不同的数字。

解决方案1：锁定

You could require the threads to acquire a lock every time before they modify the dictionary.这会稍微减慢程序执行速度。

解决方案2：总结最后的计数器

根据您的确切用例，您可能可以为每个线程分配一个单独的计数器，并在线程完成计数后将计数汇总在一起。类似字典的collections.Counter允许您轻松地将两个计数器一起添加（以上是修改为使用计数器的上述示例）：

from collections import Counter
from threading import Thread


def add_to_counter(counter):
    for i in range(100000):
        counter['ctr'] = counter.get('ctr', 0) + 1


ctr1 = Counter()
ctr2 = Counter()

t1 = Thread(target=add_to_counter, args=(ctr1,))
t2 = Thread(target=add_to_counter, args=(ctr2,))

t1.start()
t2.start()
t1.join()
t2.join()

ctr = ctr1 + ctr2

print(ctr['ctr'])