Question

我想制作一个二元模型的矩阵。我该怎么做？有任何符合我的代码的建议吗？

 import nltk
 from collections import Counter


 import codecs
 with codecs.open("Pezeshki339.txt",'r','utf8') as file:
     for line in file:
       token=line.split()

 spl = 80*len(token)/100
 train = token[:int(spl)]
 test = token[int(spl):]
 print(len(test))
 print(len(train))
 cn=Counter(train)
 known_words=([word for word,v in cn.items() if v>1])# removes the rare  words and puts them in a list

 bigram=nltk.bigrams(known_words)
 frequency=nltk.FreqDist(bigram)
 for f in frequency:
       print(f,frequency[f])

我需要类似的东西：

          w1        w2      w3          ....wn
 w1     n(w1w1)  n(w1w2)  n(w1w3)      n(w1wn)
 w2     n(w2w1)  n(w2w1)  n(w2w3)      n(w2wn)
 w3   .
  .
  .
  .
  wn

所有行和列都相同。

Answer 1

由于你需要一个单词的“矩阵”，你将使用类似字典的类。你想要一个包含bigrams中所有第一个单词的字典。要制作一个二维矩阵，它将是一个字典词典：每个值都是另一个字典，其键是双字母的第二个字，值是你正在跟踪的任何字符（可能是出现次数）。

在NLTK中，您可以使用ConditionalFreqDist()：

快速完成

mybigrams = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

但我建议你一步一步地构建你的二元组表。你会更好地理解它，你需要先使用它。

如何在我的代码中创建bigram矩阵？

1 个答案: