我有一个词组词汇,我想用这些词替换另一个文件的词。例如,我有以下词汇:
美国, 纽约
我想替换以下文件:
“我在纽约工作,但我甚至不住在美国”
对此:
“我为New_York工作,但我甚至不住在美国”
目前我正在这样做:
import os
def _check_files_and_write_phrases(docs, worker_num):
print("worker ", worker_num," started!")
for i, file in enumerate(docs):
file_path = DOCS_FOLDER + file
with open(file_path) as f:
text = f.read()
for phrase in phrases:
text = text.replace(phrase, phrase.replace(' ','_'))
new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
docs = os.listdir(DOCS_FOLDER)
import threading
threads = []
for i in range(1, 11):
print(i)
start = int((len(docs)/10) * (i - 1))
end = int((len(docs)/10) * (i))
print(start,end)
if i != 10:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
else:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
threads.append(t)
t.start()
for t in threads:
t.join()
print("all workers finished!")
但这太慢了!我认为线程可以完成这项工作,但我错了......
还有其他高效方式吗?
答案 0 :(得分:1)
可以使用单个re.sub()
调用替换所有短语,这些调用可以预编译以进一步加快速度:
import re
phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))
def _check_files_and_write_phrases(docs, worker_num):
print("worker {} started!".format(worker_num))
for i, filename in enumerate(docs):
file_path = DOCS_FOLDER + filename
with open(file_path) as f:
text = f.read()
text = re_replace.sub(lambda x: phrases[x.group(1)], text)
new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
首先根据短语词典创建一个正则表达式,如下所示:
\b(United\ States|New\ York)\b
re.sub()
函数然后使用phrases
字典来查找所需的短语替换。它需要两个参数,替换和原始文本。替换可以是固定字符串,或者在这种情况下使用函数。该函数将单个参数作为匹配对象,并返回替换文本。 lambda
函数用于执行此操作,它只是在phrases
字典中查找匹配对象。
不是进行字典查找,而是可以在这里使用replace()
,但预先计算的替换文本应该更快。 \b
被添加为仅在字边界上进行替换,因此例如将跳过MYNew York
。如果需要,可以使用flags=re.I
添加re.compile()
来使搜索大小写不敏感。
答案 1 :(得分:0)
尝试更改{{1}}循环以仅替换文本中存在的短语:
{{1}}
使用和不使用线程尝试。