有没有一种方法可以使维基百科(Python)中的短语计数速度更快

时间:2019-05-20 10:04:11

标签: python regex wikipedia

我已经下载了维基百科转储(enwiki),并且我试图计算每个短语在整个转储中出现的次数。这就是我在python中做到的方式:

import re
line_trans = str.maketrans('–’', "-\'")
words_split_re = re.compile(r'[^\w\-\']')

def findWholeWord(w):
    return re.compile(r'\b({0})\b'.format(w)).search

list_of_words_I_care = get_target_full_names() # list of phrases. Example: "John Brown", "Paula", "Mr. Hackeet" etc. it may be single  word or multiple words.
ttl_apprx_counts = 0
targets = { word:0 for word in list(set(list_of_words_I_care))} # set count number to 0 for each phrase initially
for fn in sys.argv[1:*]: # go through all *.bz2 files 
    sys.stderr.write("Processing %s\n" % fn)
    with subprocess.Popen(
        "bzcat %s | wikiextractor/WikiExtractor.py --no_templates  -o - -" % fn,
        stdout=subprocess.PIPE,
        shell=True
    ) as proc:
        while True:
            line = proc.stdout.readline() # read a line in an article 
            if not line:
                break
            if line.startswith(b'<'):
                doc_no += 1
                continue
            line = line.decode('utf-8')
            line = line.translate(line_trans)
            ttl_apprx_counts += len(filter(None, words_split_re.split(line)))
            for target_word in targets:
                _res = findWholeWord(target_word)(line)
                if _res != None:
                    targets[target_word] += 1

我认为,它可以工作,但是真的很慢。有没有办法使这段代码更快?

我运行如下代码:python get_freqs.py dumps.wikimedia.org/enwiki/20190320/*.bz2

0 个答案:

没有答案