Question

我有许多新闻文章，其中一些文章具有介绍性和结束语。可能的组合是...

有关新闻故事的一些文字。
BBC报道：有关新闻故事的一些文字。在BBC.com上了解更多信息。
BBC报道：有关新闻故事的一些文字。
有关新闻报道的一些文字。在BBC.com上了解更多信息。

我想做的是返回“关于新闻故事的一些文字”。在每种情况下。我有下面的正则表达式返回第一个和第二个示例。当有介绍性或结束性陈述时，我很努力。

re.search(r'(?i)(?<=: ).*(?=Read more|Full story|\. Source)', str(doc)).group()

# "(?i)" to ignore case.
# "(?<=: )" to capture text after and excluding ": "
# ".*" match everything between the two patterns. 
# "(?=Read more|Full story|\. Source)" match everything before these three strings.

Answer 1

似乎您可以使用

import re
doc = "The BBC reports: Some text about a news story. Read more on BBC.com."
rx = r'(?i)(?:[^:\n]*:\s*|^)(.*?)(?:$|Read more|Full story|\. Source)'
m = re.search(rx, doc)
if m:
    print(m.group(1))

请参见regex demo。

详细信息

(?i)-忽略大小写标志
(?:[^:\n]*:\s*|^)-一个非捕获组，匹配除:之外的0+个字符和一个换行符，其后跟:，然后是0+个空格或字符串的开头
(.*?)-组1：除换行符以外的任何0+个字符，应尽可能少
(?:$|Read more|Full story|\. Source)-与Read more，Full story或. Source匹配的非捕获组。

正则表达式findall在两个可选模式之间，如果没有则返回全部

1 个答案: