Question

我正在寻找一种有效的脚本，该脚本将在“ m”行（存储在“ text.txt”中）搜索“ n”个正则表达式模式（存储在“ find_these.txt”中）并列出所有正则表达式的实例找到。

文件"text.txt"包含语料库

文件"find_these.txt"包含要搜索的正则表达式模式

文件"output.txt"包含所有匹配项（语料库每一行中所有正则表达式模式的匹配项）

伪代码如下：

outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
    for line in "text.txt"
        if found then write "pattern ---> line" into outputFile
close all files

如果我能同时写行号和行号，那将非常有帮助。如：

outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
    for line in "text.txt"
        if found then write "pattern ---> lineNumber ---> line" into outputFile
close all files

有人可以帮我吗？还是将我指向某个可行的解决方案？多谢。

Answer 1

尝试一下

import re

with open("find_these.txt") as f:
    regexes = [x.strip() for x in f]

with open("text.txt") as f:
    text = [x.strip() for x in f]
print text

with open("output.txt", "w") as f:
    for pattern in regexes:
        for index, line in enumerate(text):
            if re.search(pattern, line):
                f.write("%s ---> %s ---> %s\n" %(pattern, index+1, line))

请注意，这可能不是最有效的方法，但是对于相对较小的数据集，它应该可以解决您的问题。

Answer 2

由于循环的顺序，我相信现有的答案会导致一些问题，所以我想我会写一个快速的解决方案。

要注意的重要事项：

从这样的文件中读取模式可能会导致一些问题，因为我并不完全有信心。不过已经很晚了，明天睡一会儿，我会更新此信息。
此解决方案将所有模式保留在整个程序中，这取决于模式的数量。

find_these.txt：

hello
hi\n
\bbye\b

text.txt：

Hello there. Goodbye.
A line. Bye line! Another line
which ends here. Hi

代码：

import csv
import re

with open("../resources/find_these.txt") as patt_file:
    re_patts = [line.rstrip("\n") for line in patt_file]

out_headers = ["line_num", "line_text", "patt", "match_start", "match_end"]

with open("../resources/text.txt") as corp_file, open("../out/output.txt", "w", newline="") as out_file:
    out_writer = csv.writer(out_file)
    out_writer.writerow(out_headers)

    for line_num, corp_line in enumerate(corp_file, start=1):
        for curr_patt in re_patts:
            for curr_match in re.finditer(curr_patt, corp_line, re.IGNORECASE):
                out_writer.writerow((line_num, corp_line, curr_patt, curr_match.start(), curr_match.end()))

output.txt：

line_num,line_text,patt,match_start,match_end
1,"Hello there. Goodbye.
",hello,0,5
2,"A line. Bye line! Another line
",\bbye\b,8,11
3,"which ends here. Hi
",hi\n,17,20

让我知道是否不清楚：）

在文本（存储在另一个文件中）中逐行查找多个正则表达式（存储在文件中）

2 个答案: