我想使用以下代码计算同现和创建同现矩阵:https://stackoverflow.com/a/42814963。
from collections import OrderedDict
document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]
names = ['A', 'B', 'C', 'D']
occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)
# Find the co-occurrences:
for l in document:
for i in range(len(l)):
for item in l[:i] + l[i + 1:]:
occurrences[l[i]][item] += 1
不幸的是,我的数据集由2Mio组成。 +文件。因此,我想将分析分成多个块,并在每个块之后通过pandas.DataFrame将数据连接起来。我在以下代码中尝试过:
k=3000
df_final = pandas.DataFrame()
for i in range(0, (len(vocabulary)//k)+k):
vocabulary_chunk = []
vocabulary_chunk = vocabulary[(k*i):((i+1)*k)]
occurrences = OrderedDict((name, OrderedDict((name, 0) for name in vocabulary)) for name in vocabulary_chunk)
# Find the co-occurrences:
for l in term_constellation:
for i in range(len(l)):
for item in l[:i] + l[i + 1:]:
occurrences[l[i]][item] += 1
df = pandas.DataFrame(data=occurrences)
df_final = df_final + df
pandas.DataFrame(data=df_final).to_csv("Output.csv", sep=";", decimal=',')
按照我的逻辑思维,这应该可行。不幸的是,我收到以下错误:
occurrences[l[i]][item] += 1
KeyError: 'factor'
我将代码应用于更少的文档,并且没有发生错误,但是输出文件不仅包含行名和列名,而且没有值。
我的代码丢失了什么?