答案

Question

我编写了一个程序，该程序对在文本文件中出现5次或多次的三字组进行计数。应按频率将三联词打印出来。

我找不到问题！

我收到以下错误消息：

列表索引超出范围

我试图扩大范围，但没有成功

struct Color

结果应如下所示：（简体）

extension Color

this is a
is a trigram
a trigram which
trigram which is
which is an
is an example

Answer 1

正如评论所指出的，您正在使用words遍历列表i，并且当words[i+1]到达最后一个单元格时，您尝试访问i words中的i+1将超出范围。

我建议您阅读本教程以使用纯python生成n-gram：http://www.albertauyeung.com/post/generating-ngrams-python/

答案

如果您没有太多时间阅读全部内容，以下是我建议从链接中改编的功能：

def get_ngrams_count(words, n):
    # generates a list of Tuples representing all n-grams
    ngrams_tuple = zip(*[words[i:] for i in range(n)])
    # turn the list into a dictionnary with the counts of all ngrams
    ngrams_count = {}
    for ngram in ngrams_tuple:
        if ngram not in ngrams_count:
            ngrams_count[ngram] = 0
        ngrams_count[ngram] += 1
    return ngrams_count

trigrams = get_ngrams_count(words, 3)

请注意，您可以使用Counter（它的子类为dict（使它与您的代码兼容））来简化此功能：

from collections import Counter


def get_ngrams_count(words, n):
    # turn the list into a dictionnary with the counts of all ngrams
    return Counter(zip(*[words[i:] for i in range(n)]))

trigrams = get_ngrams_count(words, 3)

注意事项

您可以在reverse中使用布尔参数.sort()对列表进行排序，从最常见到最不常见：

l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)

这比按升序对列表进行排序，然后用.reverse()反转列表快一点。

用于打印排序列表的更通用的功能（将适用于所有n-gram而不只是三字母组）：

for ngram, count in l:
    if count < 5:
        break
    # " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
    print(" ".join(ngram), count)

我想用文本文件制作三词组字典，但是出了点问题，我不知道这是什么

1 个答案:

答案

注意事项