我已经计算出给定语料和关键词的共现矩阵,但这是不正确的。
代码是:
Courpus=["abc def ijk pqr","pqr klm opq", "lmn pqr xyz abc def pqr abc"]
top_words=["abc", "pqr", "def"]
m = np.zeros([3,3])
cooccurrence_matrix = pd.DataFrame(m, index = top_words, columns = top_words)
for sent in Courpus:
word = sent.split(" ")
for i,d in enumerate(word):
for j in range(max(i - 2, 0), min(i + 2,len(word))):
try:
if (word[i] != word[j]):
cooccurrence_matrix.loc[word[i], word[j]] += 1
except:
pass
print(cooccurrence_matrix)
输出为: abc pqr def abc 0.0 2.0 3.0 pqr 2.0 0.0 2.0 def 2.0 1.0 0.0
Expected Output:
abc pqr def
abc: 0 3 3
pqr: 3 0 2
def: 3 2 0