比较单词列表和句子列表并打印匹配行的Python方法

时间:2018-11-21 20:13:15

标签: python

我目前正在清理数据库,数据库变得非常耗时。典型的

for email in emails:   

循环几乎没有足够快的速度。

例如,我目前正在将230,000封电子邮件列表与39,000,000行的完整记录列表进行比较。将这些电子邮件与它们所属的记录行进行匹配并打印需要花费几个小时。有谁知道如何在此查询中实现线程化以加快速度?尽管这非常快

strings = ("string1", "string2", "string3")
for line in file:
    if any(s in line for s in strings):
        print "yay!"

那将永远不会打印出匹配的线,只会打印针。

提前谢谢

2 个答案:

答案 0 :(得分:2)

一种可能性是使用set来存储电子邮件。这使得检查if word in emails O(1)。因此,完成的工作与文件中的单词总数成正比:

emails = {"string1", "string2", "string3"} # this is a set

for line in f:
    if any(word in emails for word in line.split()):
        print("yay!")

您最初的解决方案是 O(nm)(用于 n 个单词和 m 个电子邮件),而不是 O(n) set

答案 1 :(得分:1)

这里是使用线程的示例解决方案。这段代码将您的数据分成相等的块,并按我们声明的线程数将它们用作compare()的参数。

strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',\
         '1234', 'string2', 'string1',\
         "string1", 'abcd', 'xyz']

def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
        if any(s in line for s in strings):
            print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))

线程部分:

from threading import Thread

threads = []
threads_amount = 3
chunk_size = len(lines) // threads_amount

for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

for i in range(threads_amount):
    threads[i].join()

输出:

Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished