在文本(存储在另一个文件中)中逐行查找多个正则表达式(存储在文件中)

时间:2020-04-26 00:48:41

标签: python

我正在寻找一种有效的脚本,该脚本将在“ m”行(存储在“ text.txt”中)搜索“ n”个正则表达式模式(存储在“ find_these.txt”中)并列出所有正则表达式的实例找到。

文件"text.txt"包含语料库

文件"find_these.txt"包含要搜索的正则表达式模式

文件"output.txt"包含所有匹配项(语料库每一行中所有正则表达式模式的匹配项)

伪代码如下:

outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
    for line in "text.txt"
        if found then write "pattern ---> line" into outputFile
close all files

如果我能同时写行号和行号,那将非常有帮助。如:

outputFile = open('output.txt', 'w')
for pattern in "find_these.txt"
    for line in "text.txt"
        if found then write "pattern ---> lineNumber ---> line" into outputFile
close all files

有人可以帮我吗?还是将我指向某个可行的解决方案?多谢。

2 个答案:

答案 0 :(得分:-1)

尝试一下

import re

with open("find_these.txt") as f:
    regexes = [x.strip() for x in f]

with open("text.txt") as f:
    text = [x.strip() for x in f]
print text

with open("output.txt", "w") as f:
    for pattern in regexes:
        for index, line in enumerate(text):
            if re.search(pattern, line):
                f.write("%s ---> %s ---> %s\n" %(pattern, index+1, line))

请注意,这可能不是最有效的方法,但是对于相对较小的数据集,它应该可以解决您的问题。

答案 1 :(得分:-4)

由于循环的顺序,我相信现有的答案会导致一些问题,所以我想我会写一个快速的解决方案。


要注意的重要事项:

  • 从这样的文件中读取模式可能会导致一些问题,因为我并不完全有信心。不过已经很晚了,明天睡一会儿,我会更新此信息。
  • 此解决方案将所有模式保留在整个程序中,这取决于模式的数量。

find_these.txt

hello
hi\n
\bbye\b

text.txt

Hello there. Goodbye.
A line. Bye line! Another line
which ends here. Hi

代码:

import csv
import re

with open("../resources/find_these.txt") as patt_file:
    re_patts = [line.rstrip("\n") for line in patt_file]

out_headers = ["line_num", "line_text", "patt", "match_start", "match_end"]

with open("../resources/text.txt") as corp_file, open("../out/output.txt", "w", newline="") as out_file:
    out_writer = csv.writer(out_file)
    out_writer.writerow(out_headers)

    for line_num, corp_line in enumerate(corp_file, start=1):
        for curr_patt in re_patts:
            for curr_match in re.finditer(curr_patt, corp_line, re.IGNORECASE):
                out_writer.writerow((line_num, corp_line, curr_patt, curr_match.start(), curr_match.end()))

output.txt

line_num,line_text,patt,match_start,match_end
1,"Hello there. Goodbye.
",hello,0,5
2,"A line. Bye line! Another line
",\bbye\b,8,11
3,"which ends here. Hi
",hi\n,17,20

让我知道是否不清楚:)