我编写了一个程序,该程序对在文本文件中出现5次或多次的三字组进行计数。应按频率将三联词打印出来。
我找不到问题!
我收到以下错误消息:
列表索引超出范围
我试图扩大范围,但没有成功
struct Color
结果应如下所示:(简体)
extension Color
this is a is a trigram a trigram which trigram which is which is an is an example
答案 0 :(得分:2)
正如评论所指出的,您正在使用words
遍历列表i
,并且当words[i+1]
到达最后一个单元格时,您尝试访问i
words
中的i+1
将超出范围。
我建议您阅读本教程以使用纯python生成n-gram:http://www.albertauyeung.com/post/generating-ngrams-python/
如果您没有太多时间阅读全部内容,以下是我建议从链接中改编的功能:
def get_ngrams_count(words, n):
# generates a list of Tuples representing all n-grams
ngrams_tuple = zip(*[words[i:] for i in range(n)])
# turn the list into a dictionnary with the counts of all ngrams
ngrams_count = {}
for ngram in ngrams_tuple:
if ngram not in ngrams_count:
ngrams_count[ngram] = 0
ngrams_count[ngram] += 1
return ngrams_count
trigrams = get_ngrams_count(words, 3)
请注意,您可以使用Counter(它的子类为dict
(使它与您的代码兼容))来简化此功能:
from collections import Counter
def get_ngrams_count(words, n):
# turn the list into a dictionnary with the counts of all ngrams
return Counter(zip(*[words[i:] for i in range(n)]))
trigrams = get_ngrams_count(words, 3)
reverse
中使用布尔参数.sort()
对列表进行排序,从最常见到最不常见:l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)
这比按升序对列表进行排序,然后用.reverse()
反转列表快一点。
for ngram, count in l:
if count < 5:
break
# " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
print(" ".join(ngram), count)