我想计算已经转换为标记的文本文件中特定单词前后三个单词的频率。
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)
我不知道如何搜索字符串'dracula'作为过滤词。我也尝试过:
text.collocations(num=100)
text.concordance('dracula')
所需的输出看起来像这样: “吸血鬼”之前的三个词,按计数排序
(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)
“吸血鬼”之后的三个词,按计数排序
(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)
中间包含“吸血鬼”的三字组,计数排序
(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)
在此先感谢您的帮助。
答案 0 :(得分:0)
一旦您获得了元组格式的频率信息,就可以使用if
语句简单地过滤出您要查找的单词。这是使用Python的 list理解语法:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
# pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)
dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.
这将产生在不同位置带有“吸血鬼”的列表。 dracula_last
如下所示:
[(('the', 'castle', 'of', 'dracula'), 3),
(("'s", 'journal', '243', 'dracula'), 1),
(('carpathian', 'moun-', '2', 'dracula'), 1),
(('of', 'the', 'castle', 'dracula'), 1),
(('named', 'by', 'count', 'dracula'), 1),
(('disease', '.', 'count', 'dracula'), 1),
...]