Python提取包含2个单词的句子

时间:2013-08-30 09:11:35

标签: python regex nltk sentence text-segmentation

我遇到了与此链接Python extract sentence containing word中讨论的相同的问题,但不同之处在于我想在同一个句子中找到2个单词。我需要从语料库中提取句子,其中包含2个特定单词。有人能帮帮我吗?

3 个答案:

答案 0 :(得分:2)

使用TextBlob包和Python的内置sets这很简单。

基本上,遍历文本的句子,检查它们是否存在于句子中的单词集与搜索单词之间的交集。

from text.blob import TextBlob

search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = []
for sentence in blob.sentences:
    words = set(sentence.words)
    if search_words & words:  # intersection
        matches.append(str(sentence))
print(matches)
# ["Let's go buy some apples."]

更新: 或者,更多的Python,

from text.blob import TextBlob

search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = [str(s) for s in blob.sentences if search_words & set(s.words)]
print(matches)
# ["Let's go buy some apples."]

答案 1 :(得分:1)

如果这是你的意思:

import re
txt="I like to eat apple. Me too. Let's go buy some apples."
define_words = 'some apple'
print re.findall(r"([^.]*?%s[^.]*\.)" % define_words,txt)  

Output: [" Let's go buy some apples."]

您也可以尝试:

define_words = raw_input("Enter string: ")

检查句子是否包含定义的单词:

import re
txt="I like to eat apple. Me too. Let's go buy some apples."
words = 'go apples'.split(' ')

sentences = re.findall(r"([^.]*\.)" ,txt)  
for sentence in sentences:
    if all(word in sentence for word in words):
        print sentence

答案 2 :(得分:0)

我想你想要一个使用nltk的答案。我猜这两个字不需要连续正确吗?

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> text = 'I like to eat apple. Me too. Let's go buy some apples.'
>>> words = ['like', 'apple']
>>> sentences = sent_tokenize(text)
>>> for sentence in sentences:
...   if (all(map(lambda word: word in sentence, words))):
...      print sentence
...
I like to eat apple.