Python Regex-在文本文件中的(多个)表达式之间提取文本

时间:2018-11-06 09:55:15

标签: python regex text-mining text-extraction

我是Python初学者,如果您能帮助我解决文本提取问题,将非常感谢。

我想提取所有文本,该文本位于文本文件中两个表达式之间(字母的开头和结尾)。对于字母的开头和结尾,都有多个可能的表达式(在列表“ letter_begin”和“ letter_end”中定义,例如“亲爱的”,“致我们的”等)。我要分析一堆文件,下面找到一个这样的文本文件的示例->我要提取从“亲爱的”到“道格拉斯”的所有文本。如果“ letter_end”不匹配,即未找到letter_end表达式,则输出应从letter_beginning开始,并在要分析的文本文件的末尾结束。

编辑:“记录的文本”的结尾必须在“ letter_end”的匹配之后且第一行包含20个或更多字符(与“此处的随机文本”也是这种情况-> len = 24。

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

到目前为止,这是我的代码-但它不能灵活地捕获表达式之间的文本(“ letter_begin”之前和“ letter_end”之后可以有任何内容(线条,文本,数字,符号等) “)

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

我非常感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

您可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

此模式将导致类似正则表达式

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

请参见regex demo。请注意,请勿将re.DOTALL与这种模式一起使用,并且re.MULTILINE选项也是多余的。

详细信息

  • (?:dear|to our|estimated)-三个值中的任何一个
  • [\s\S]*?-任意0个以上的字符,尽可能少
  • (?:sincerely|yours|best regards)-三个值中的任何一个
  • .*-除换行符外的任何0+个字符
  • (?:\n.*){0,2}-换行符为零,一或两次重复,后跟除换行符以外的任何0+字符。

Python demo code

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']