Question

我正在尝试从包含单词列表的大型文本集中提取句子。

例如搜索“ noodl”，“ vege”和“ meat”。

str1 = "My new noodles are great\n vegetables. Not \nthis noodle sentence though.\n Nor this vege sentences."
results = re.findall(regex, str1)

应该返回“我的新面条很棒\ n蔬菜。”作为唯一匹配项。

从（Python extracting sentence containing 2 words）起，我想出了以下正则表达式：

regex = re.compile(
            r"""
            ([^.]*?# Starting with anything but .
                 (# Capture group start
                    (noodl|vege|meat)# Countains these words
                    [^.]*#with anything but . in between
                 ){2,}# At least 2 times
                [^.]*\.# Followed by anything but '.' followed by '.'
                )
                        """,
            re.MULTILINE | re.IGNORECASE | re.VERBOSE)

但这会导致

for x in results:
    print(x)
#My new noodles are great\n vegetables.
#vegetables
#vege

这是意外的。应该如何更改我的正则表达式以仅匹配整个句子？找到的句子将得到进一步处理。处理的自然语言不是英语，但当前结果与演示语句相同。

Python提取包含特殊单词列表的句子

0 个答案: