Question

我试图从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它正在归还我：

[".I like to eat apple. Me too. Let's go buy some apples."]

而不是：

[".I like to eat apple., "Let's go buy some apples."]

请帮忙吗？

Answer 1

不需要正则表达式：

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]

Answer 2

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

Answer 3

In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但请注意，@ jamylak基于split的解决方案更快：

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

对于较大的琴弦，速度差异较小，但仍然很重要：

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

Answer 4

您可以使用str.split，

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]

Answer 5

r"\."+".+"+"apple"+".+"+"\."

这条线有点奇怪;为什么连接这么多单独的字符串？你可以使用r'.. + apple。+。'。

无论如何，正则表达式的问题在于它的贪婪。默认情况下，x+会尽可能多地匹配x。因此，.+将匹配尽可能多的字符（任何字符）;包括点和apple s。

你想要使用的是一种非贪婪的表达;您通常可以在最后添加?来完成此操作：.+?。

这将使您获得以下结果：

['.I like to eat apple. Me too.']

你可以看到你不再同时获得苹果句子，但仍然是Me too.。这是因为您仍然匹配.之后的apple，因此无法捕获以下句子。

正常运作的正则表达式为：r'\.[^.]*?apple[^.]*?\.'

在这里，您不会查看任何字符，而只会查看那些本身不是点的字符。我们也允许不匹配任何字符（因为在第一句中的apple之后没有非点字符）。使用该表达式得出：

['.I like to eat apple.', ". Let's go buy some apples."]

Answer 6

显然，有问题的样本是extract sentence containing word而不是
extract sentence containing word。如何通过python解决def searchWordinSentence(word,sentence): pattern = re.compile(' '+word+' |^'+word+' | '+word+' $') if re.search(pattern,sentence): return True问题如下：

一个单词可以在句子的开头|中间。不限于问题中的示例，我将提供在句子中搜索单词的一般功能：

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

仅限于问题中的示例，我们可以解决如下：

['I like to eat apple']

相应的输出是：

{{1}}

Python提取包含单词的句子

6 个答案: