从词汇表中替换字符串的有效方法 - Python

时间:2017-12-06 13:10:11

标签: python regex multithreading python-multithreading

我有一个词组词汇,我想用这些词替换另一个文件的词。例如,我有以下词汇:

美国, 纽约

我想替换以下文件:

“我在纽约工作,但我甚至不住在美国”

对此:

“我为New_York工作,但我甚至不住在美国”

目前我正在这样做:

import os

def _check_files_and_write_phrases(docs, worker_num):
    print("worker ", worker_num," started!")
    for i, file in enumerate(docs):
        file_path = DOCS_FOLDER + file
        with open(file_path) as f:
            text = f.read()
            for phrase in phrases:
                text = text.replace(phrase, phrase.replace(' ','_'))
            new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)


docs = os.listdir(DOCS_FOLDER)

import threading

threads = []
for i in range(1, 11):
    print(i)
    start = int((len(docs)/10) * (i - 1))
    end = int((len(docs)/10) * (i))
    print(start,end)
    if i != 10:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
    else:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print("all workers finished!")

但这太慢了!我认为线程可以完成这项工作,但我错了......

还有其他高效方式吗?

2 个答案:

答案 0 :(得分:1)

可以使用单个re.sub()调用替换所有短语,这些调用可以预编译以进一步加快速度:

import re

phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))

def _check_files_and_write_phrases(docs, worker_num):
    print("worker {} started!".format(worker_num))

    for i, filename in enumerate(docs):
        file_path = DOCS_FOLDER + filename

        with open(file_path) as f:
            text = f.read()
            text = re_replace.sub(lambda x: phrases[x.group(1)], text)
            new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'

            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)

首先根据短语词典创建一个正则表达式,如下所示:

\b(United\ States|New\ York)\b

re.sub()函数然后使用phrases字典来查找所需的短语替换。它需要两个参数,替换和原始文本。替换可以是固定字符串,或者在这种情况下使用函数。该函数将单个参数作为匹配对象,并返回替换文本。 lambda函数用于执行此操作,它只是在phrases字典中查找匹配对象。

不是进行字典查找,而是可以在这里使用replace(),但预先计算的替换文本应该更快。 \b被添加为仅在字边界上进行替换,因此例如将跳过MYNew York。如果需要,可以使用flags=re.I添加re.compile()来使搜索大小写不敏感。

答案 1 :(得分:0)

尝试更改{{1}}循环以仅替换文本中存在的短语:

{{1}}

使用和不使用线程尝试。