如何优化共现矩阵的代码以进行缩放?

时间:2019-06-11 12:24:39

标签: python-3.x csv n-gram find-occurrences

我想使用以下代码计算同现和创建同现矩阵:https://stackoverflow.com/a/42814963

from collections import OrderedDict

document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]
names = ['A', 'B', 'C', 'D']

occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)

# Find the co-occurrences:
for l in document:
    for i in range(len(l)):
        for item in l[:i] + l[i + 1:]:
            occurrences[l[i]][item] += 1

不幸的是,我的数据集由2Mio组成。 +文件。因此,我想将分析分成多个块,并在每个块之后通过pandas.DataFrame将数据连接起来。我在以下代码中尝试过:

k=3000
df_final = pandas.DataFrame()
for i in range(0, (len(vocabulary)//k)+k):
    vocabulary_chunk = []
    vocabulary_chunk = vocabulary[(k*i):((i+1)*k)] 

    occurrences = OrderedDict((name, OrderedDict((name, 0) for name in vocabulary)) for name in vocabulary_chunk)

    # Find the co-occurrences:
    for l in term_constellation:
        for i in range(len(l)):
            for item in l[:i] + l[i + 1:]:
                occurrences[l[i]][item] += 1

    df = pandas.DataFrame(data=occurrences)
    df_final = df_final + df


pandas.DataFrame(data=df_final).to_csv("Output.csv", sep=";", decimal=',')

按照我的逻辑思维,这应该可行。不幸的是,我收到以下错误:

occurrences[l[i]][item] += 1

KeyError: 'factor'

我将代码应用于更少的文档,并且没有发生错误,但是输出文件不仅包含行名和列名,而且没有值。

我的代码丢失了什么?

0 个答案:

没有答案