我正在尝试从包含单词列表的大型文本集中提取句子。
例如搜索“ noodl”,“ vege”和“ meat”。
str1 = "My new noodles are great\n vegetables. Not \nthis noodle sentence though.\n Nor this vege sentences."
results = re.findall(regex, str1)
应该返回“我的新面条很棒\ n蔬菜。”作为唯一匹配项。
从(Python extracting sentence containing 2 words)起,我想出了以下正则表达式:
regex = re.compile(
r"""
([^.]*?# Starting with anything but .
(# Capture group start
(noodl|vege|meat)# Countains these words
[^.]*#with anything but . in between
){2,}# At least 2 times
[^.]*\.# Followed by anything but '.' followed by '.'
)
""",
re.MULTILINE | re.IGNORECASE | re.VERBOSE)
但这会导致
for x in results:
print(x)
#My new noodles are great\n vegetables.
#vegetables
#vege
这是意外的。应该如何更改我的正则表达式以仅匹配整个句子?找到的句子将得到进一步处理。处理的自然语言不是英语,但当前结果与演示语句相同。