我正在寻找一种有效的脚本,该脚本将在“ m”行(存储在“ text.txt”中)搜索“ n”个正则表达式模式(存储在“ find_these.txt”中)并列出所有正则表达式的实例找到。
文件"text.txt"
包含语料库
文件"find_these.txt"
包含要搜索的正则表达式模式
文件"output.txt"
包含所有匹配项(语料库每一行中所有正则表达式模式的匹配项)
伪代码如下:
outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
for line in "text.txt"
if found then write "pattern ---> line" into outputFile
close all files
如果我能同时写行号和行号,那将非常有帮助。如:
outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
for line in "text.txt"
if found then write "pattern ---> lineNumber ---> line" into outputFile
close all files
有人可以帮我吗?还是将我指向某个可行的解决方案?多谢。
答案 0 :(得分:-1)
尝试一下
import re
with open("find_these.txt") as f:
regexes = [x.strip() for x in f]
with open("text.txt") as f:
text = [x.strip() for x in f]
print text
with open("output.txt", "w") as f:
for pattern in regexes:
for index, line in enumerate(text):
if re.search(pattern, line):
f.write("%s ---> %s ---> %s\n" %(pattern, index+1, line))
请注意,这可能不是最有效的方法,但是对于相对较小的数据集,它应该可以解决您的问题。
答案 1 :(得分:-4)
由于循环的顺序,我相信现有的答案会导致一些问题,所以我想我会写一个快速的解决方案。
要注意的重要事项:
find_these.txt
:
hello
hi\n
\bbye\b
text.txt
:
Hello there. Goodbye.
A line. Bye line! Another line
which ends here. Hi
代码:
import csv
import re
with open("../resources/find_these.txt") as patt_file:
re_patts = [line.rstrip("\n") for line in patt_file]
out_headers = ["line_num", "line_text", "patt", "match_start", "match_end"]
with open("../resources/text.txt") as corp_file, open("../out/output.txt", "w", newline="") as out_file:
out_writer = csv.writer(out_file)
out_writer.writerow(out_headers)
for line_num, corp_line in enumerate(corp_file, start=1):
for curr_patt in re_patts:
for curr_match in re.finditer(curr_patt, corp_line, re.IGNORECASE):
out_writer.writerow((line_num, corp_line, curr_patt, curr_match.start(), curr_match.end()))
output.txt
:
line_num,line_text,patt,match_start,match_end
1,"Hello there. Goodbye.
",hello,0,5
2,"A line. Bye line! Another line
",\bbye\b,8,11
3,"which ends here. Hi
",hi\n,17,20
让我知道是否不清楚:)