Question

我有一个文本文档，我使用regex和nltk来查找本文档中最常见的5个最常用词。我必须打印出这些词所属的句子，我该怎么做？此外，我想将其扩展到在多个文档中查找常用词并返回各自的句子。

import nltk
import collections
from collections import Counter

import re
import string

frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15]

fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution  from a list
most_common = fdist.max()    # returns a single element
top_five = fdist.most_common(5)# returns a list

list_5=[word for (word, freq) in fdist.most_common(5)]


print(top_five)
print(list_5)

输出：

[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)]
['you', 'tuples', 'the', 'are', 'pard']

输出是最常见的单词我必须打印这些单词所属的句子，我该怎么做？

Answer 1

虽然它没有像你的代码那样考虑字边界的特殊字符，但以下是一个起点：

for sentence in text_string.split('.'):
    if list(set(list_5) & set(sentence.split(' '))):
        print sentence

我们首先迭代句子，假设每个句子以.结尾，而.字符在文本中没有其他地方。之后，如果list_5中包含一组字词的intersection字词{{3}}不为空，我们就会打印该句子。

Answer 2

如果您还没有安装NLTK数据，则必须安装。

来自http://www.nltk.org/data.html：

运行Python解释器并输入命令：

> >>> import nltk
> >>> nltk.download()

应打开一个新窗口，显示NLTK Downloader。点击文件菜单，然后选择更改下载目录。

然后从models选项卡安装punkt模型。一旦你拥有了它，你就可以对所有句子进行标记，然后用它们中的前5个单词提取出来：

sent_tokenize_list = nltk.sent_tokenize(text_string)    
for sentence in sent_tokenize_list:
    for word in list_5:
        if word in sentence:
            print(sentence)

使用Python打印属于文档中最常见单词的句子

2 个答案: