Python提取包含单词的句子

时间:2013-04-16 09:03:12

标签: python regex text-segmentation

我试图从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它正在归还我:

[".I like to eat apple. Me too. Let's go buy some apples."]

而不是:

[".I like to eat apple., "Let's go buy some apples."]

请帮忙吗?

6 个答案:

答案 0 :(得分:17)

不需要正则表达式:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]

答案 1 :(得分:12)

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

答案 2 :(得分:8)

In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但请注意,@ jamylak基于split的解决方案更快:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

对于较大的琴弦,速度差异较小,但仍然很重要:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

答案 3 :(得分:3)

您可以使用str.split

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]

答案 4 :(得分:2)

r"\."+".+"+"apple"+".+"+"\."

这条线有点奇怪;为什么连接这么多单独的字符串?你可以使用r'.. + apple。+。'。

无论如何,正则表达式的问题在于它的贪婪。默认情况下,x+会尽可能多地匹配x。因此,.+将匹配尽可能多的字符(任何字符);包括点和apple s。

你想要使用的是一种非贪婪的表达;您通常可以在最后添加?来完成此操作:.+?

这将使您获得以下结果:

['.I like to eat apple. Me too.']

你可以看到你不再同时获得苹果句子,但仍然是Me too.。这是因为您仍然匹配.之后的apple,因此无法捕获以下句子。

正常运作的正则表达式为:r'\.[^.]*?apple[^.]*?\.'

在这里,您不会查看任何字符,而只会查看那些本身不是点的字符。我们也允许不匹配任何字符(因为在第一句中的apple之后没有非点字符)。使用该表达式得出:

['.I like to eat apple.', ". Let's go buy some apples."]

答案 5 :(得分:0)

显然,有问题的样本是extract sentence containing word而不是
extract sentence containing word。如何通过python解决def searchWordinSentence(word,sentence): pattern = re.compile(' '+word+' |^'+word+' | '+word+' $') if re.search(pattern,sentence): return True 问题如下:

一个单词可以在句子的开头|中间。不限于问题中的示例,我将提供在句子中搜索单词的一般功能:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

仅限于问题中的示例,我们可以解决如下:

['I like to eat apple']

相应的输出是:

{{1}}