Question

我有2个文件，我想从file2（fsearch）获取包含来自file1（forig）的任何给定行的所有行我写了一个简单的python脚本，如下所示：

def search_string(w, file):
        global matches
        reg = re.compile((r'(^|^.*\|)' + w.strip("\r\n") + r'(\t|\|).*$'), re.M)
        match = reg.findall(file)
        matches.extend(match)

fsearch_text = fsearch.read()
for fword in forig:
        search_string(fword, fsearch_text)

file1中大约有100,000行，file2中大约有200,000行，因此我的脚本大约需要6个小时才能完成。
是否有更好的算法可以在更短的时间内实现相同的目标？

编辑：我应该举例说明为什么我需要regexp：
我正在搜索file1中的单词列表，并尝试将它们与file2的翻译匹配。如果我不使用正则表达式来限制可能的匹配，我也会匹配仅包含单词I search的单词的翻译，例如：
我搜索的字：浸し
匹配单词：お浸し|御浸し|御したし＆amp; n鲣鱼酱油（蔬菜配菜）煮青菜所以我必须通过^或|来限制匹配的开始，并通过\ t或|来限制匹配的结束，但是捕获整行

Answer 1

假设您可以在内存中同时拥有这两个文件。你可以阅读它们并对它们进行排序。

之后，您可以线性比较线条。

f1 = open('e:\\temp\\file1.txt')

lines1 = sorted([line for line in f1])

f2 = open('e:\\temp\\file2.txt')

lines2 = sorted([line for line in f2])

i1 = 0
i2 = 0
matchCount = 0
while (i1 < len(lines1) and i2 < len(lines2)):
    line1 = lines1[i1]
    line2 = lines2[i2]
    if line1 < line2:    
        i1 += 1
    elif line1 > line2:
        i2 += 1
    else:
        matchCount += 1
        i2 += 1

print('matchCount')    
print(matchCount)

使用regexp进行Python跨文件搜索

1 个答案: