Python Regex-止于> 20个字符

时间:2018-11-07 11:20:58

标签: python regex text-extraction

我有一封信中需要提取特定部分。开头和结尾用清晰的开头/结尾表达式(letter_beg / letter_end)标记。我的问题是,在letter_end的“匹配”之后,文本的“记录”需要在第一行之前结束并超过20个字符。在我的代码中,它在2行之后执行。到目前为止,这是我的示例文本和代码:

sample_text = """Some random text right here 
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards 
Douglas - Director
Other random text with more than 20 chars in this line    """

letter_begin = ["dear", "to our", "fellow investors"] # All expressions for "beginning" of Letter to the Shareholders (LttS)
openings = "|".join(letter_begin)
letter_end = ["sincerely", "best regards", "cordially,"] # All expressions for "ending" of Letter to the Shareholders (LttS)
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}"
output = re.findall(regex, text, re.IGNORECASE) # record all text between Regex (beginning and end expressions)
print(output)

2 个答案:

答案 0 :(得分:1)

我不太确定您的预期输出是什么,但是在没有正则表达式的情况下执行此操作非常简单(因此可以解决一个问题)。

下面的解决方案假定sample_text包含\n(换行符),并且如果sample_text是一个长行(即没有任何\n)将不起作用。

sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""

letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]

lines = sample_text.strip().split("\n")

target_start_idx = None
target_end_idx = None

for index, line in enumerate(lines):
    line = line.lower()

    if any(line.startswith(beg) for beg in letter_begin):
        target_start_idx = index
        continue

    if any(line.startswith(end) for end in letter_end):
        target_end_idx = index
        break

if target_end_idx is not None:
    for index, line in enumerate(lines[target_end_idx + 1 :]):
        if len(line) >= 20:
            target_end_idx += index
            break

if target_start_idx is not None and target_end_idx is not None:
    target = "\n".join(lines[target_start_idx : target_end_idx + 1])
    print(target)

输出为

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director


编辑

根据您上一次的comment,我可以想到两种方法。希望其中之一能解决您的问题。

选项1

sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""

letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]

lines = sample_text.strip().split("\n")

target_start_indexes = []
target_end_indexes = []

for index, line in enumerate(lines):
    line = line.lower()

    if any(beg in line for beg in letter_begin):
        target_start_indexes.append(index)
        continue

    if any(end in line for end in letter_end):
        target_end_indexes.append(index)
        continue

for target_index, target_end_idx in enumerate(target_end_indexes):
    for line_index, line in enumerate(lines[target_end_idx + 1 :]):
        if len(line) >= 20:
            target_end_idx += line_index
            target_end_indexes[target_index] = target_end_idx
            break


target = []
if target_start_indexes and target_end_indexes:
    for target_start_idx, target_end_idx in zip(
        target_start_indexes, target_end_indexes
    ):
        target.append("\n".join(lines[target_start_idx : target_end_idx + 1]))

    print("\n".join(target))

输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director


选项2

sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""

letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]

lines = sample_text.strip().split("\n")

target_start_idx = None
target_end_idx = None

for index, line in enumerate(lines):
    line = line.lower()

    if any(beg in line for beg in letter_begin):
        if target_start_idx is None:
            target_start_idx = index
            continue

    if any(end in line for end in letter_end):
        target_end_idx = index

if target_end_idx is not None:
    for index, line in enumerate(lines[target_end_idx + 1 :]):
        if len(line) >= 20:
            target_end_idx += index
            break

if target_start_idx is not None and target_end_idx is not None:
    target = "\n".join(lines[target_start_idx : target_end_idx + 1])
    print(target)

输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

答案 1 :(得分:0)

如果您坚持使用整体正则表达式,请在末尾包含20个以上字符的行中添加positive lookahead

(?=[^\n]{21,})

您可能还需要添加re.DOTALL标志:

re.IGNORECASE | re.DOTALL