我有一封信中需要提取特定部分。开头和结尾用清晰的开头/结尾表达式(letter_beg
/ letter_end
)标记。我的问题是,在letter_end
的“匹配”之后,文本的“记录”需要在第一行之前结束并超过20个字符。在我的代码中,它在2行之后执行。到目前为止,这是我的示例文本和代码:
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line """
letter_begin = ["dear", "to our", "fellow investors"] # All expressions for "beginning" of Letter to the Shareholders (LttS)
openings = "|".join(letter_begin)
letter_end = ["sincerely", "best regards", "cordially,"] # All expressions for "ending" of Letter to the Shareholders (LttS)
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}"
output = re.findall(regex, text, re.IGNORECASE) # record all text between Regex (beginning and end expressions)
print(output)
答案 0 :(得分:1)
我不太确定您的预期输出是什么,但是在没有正则表达式的情况下执行此操作非常简单(因此可以解决一个问题)。
下面的解决方案假定sample_text
包含\n
(换行符),并且如果sample_text
是一个长行(即没有任何\n
)将不起作用。
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("\n")
target_start_idx = None
target_end_idx = None
for index, line in enumerate(lines):
line = line.lower()
if any(line.startswith(beg) for beg in letter_begin):
target_start_idx = index
continue
if any(line.startswith(end) for end in letter_end):
target_end_idx = index
break
if target_end_idx is not None:
for index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += index
break
if target_start_idx is not None and target_end_idx is not None:
target = "\n".join(lines[target_start_idx : target_end_idx + 1])
print(target)
输出为
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
编辑
根据您上一次的comment,我可以想到两种方法。希望其中之一能解决您的问题。
选项1
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("\n")
target_start_indexes = []
target_end_indexes = []
for index, line in enumerate(lines):
line = line.lower()
if any(beg in line for beg in letter_begin):
target_start_indexes.append(index)
continue
if any(end in line for end in letter_end):
target_end_indexes.append(index)
continue
for target_index, target_end_idx in enumerate(target_end_indexes):
for line_index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += line_index
target_end_indexes[target_index] = target_end_idx
break
target = []
if target_start_indexes and target_end_indexes:
for target_start_idx, target_end_idx in zip(
target_start_indexes, target_end_indexes
):
target.append("\n".join(lines[target_start_idx : target_end_idx + 1]))
print("\n".join(target))
输出
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
选项2
sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""
letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]
lines = sample_text.strip().split("\n")
target_start_idx = None
target_end_idx = None
for index, line in enumerate(lines):
line = line.lower()
if any(beg in line for beg in letter_begin):
if target_start_idx is None:
target_start_idx = index
continue
if any(end in line for end in letter_end):
target_end_idx = index
if target_end_idx is not None:
for index, line in enumerate(lines[target_end_idx + 1 :]):
if len(line) >= 20:
target_end_idx += index
break
if target_start_idx is not None and target_end_idx is not None:
target = "\n".join(lines[target_start_idx : target_end_idx + 1])
print(target)
输出
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
答案 1 :(得分:0)
如果您坚持使用整体正则表达式,请在末尾包含20个以上字符的行中添加positive lookahead:
(?=[^\n]{21,})
您可能还需要添加re.DOTALL
标志:
re.IGNORECASE | re.DOTALL