我有以下代码在大文本文件中查找经常出现的ngram:
import operator
import codecs
import nltk
f = codecs.open('e:/nltk_data/corpora/en/english.txt','r','utf-8')
raw = f.read()
tokens = nltk.word_tokenize(raw)
bgs = nltk.ngrams(tokens, 6)
#compute frequency distribution for all the ngrams in the text
fdist = nltk.FreqDist(bgs)
s = sorted(fdist.items(), key=operator.itemgetter(1), reverse=True)
for i in range(200):
print s[i][1], s[i][0][0], s[i][0][1], s[i][0][2], s[i][0][3], s[i][0][4], s[i][0][5]
根据以下示例,是否有人知道如何编写可以合并结果的函数:
20 a province in South Africa
30 a province in Greater America
50 a province in Eastern China
合并为:
100 a province in <thing1> <thing2>
或者
10 America is a country
20 China is a country
合并为:
30 <thing> is a country
答案 0 :(得分:0)
我不认为这是对你的问题的完整解决方案,但在交换意见后,并在我自己身边玩了一下,我发布了我遇到的代码,以防它给你一个起点
我没有您的输入文件,我创建了一个简单而简短的应用程序。我得到这些6克:
2 and quiet nights at home .
2 quiet nights at home . Seeking
1 dining out , movies , bbqs.
1 , caring woman , slim or
1 an honest , caring woman ,
1 bike riding , TV and DVDs
1 honest lady for friendship to relationship
1 , movies , bbqs. , football
1 with similar interests for friend ship/relationship
1 , country drives and quiet nights
1 . Am honest , caring ,
以及更多只有1次出现。
以下代码在6克内部获得3克(仅当6克有多次出现时)。
In [17]:
inter_grams = {}
for i in range(10):
if s[i][1] > 1:
for inter_gram in list(nltk.ngrams(s[i][0], 3)):
#print inter_gram
if inter_gram in inter_grams.keys():
inter_grams[inter_gram] += s[i][1]
else:
inter_grams[inter_gram] = s[i][1]
print inter_grams
{('nights', 'at', 'home'): 4, ('at', 'home', '.'): 4, ('and', 'quiet', 'nights'): 2, ('quiet', 'nights', 'at'): 4, ('home', '.', 'Seeking'): 2}
希望它有所帮助。