Question

我有一个功能的DataFrame，由ID索引。

ID1, Red, Green, Blue
ID2, Yellow, Green, Orange
ID3, Gray, Green, Yellow
ID4, Yellow, Green, Blue

我试图生成一个边际列表，其中余弦相似度为权重而不首先产生邻接矩阵。

我有足够的计算时间，但内存受限且数据集很大。

需要这个，不包括权重0：

ID1 ID2 Weight (cosine similarity)
01 02 0.33
01 03 0.25
01 04 0.75

（仅用于说明）

以下是我通过邻接矩阵解决这个问题的方法。

import pandas as pd
import numpy as np 
from sklearn.metrics.pairwise import cosine_similarity

df = df.pivot_table(index = ('ID'), columns= 'color', aggfunc=len, fill_value=0)
matrix = df.as_matrix().astype(np.float32)
matrix = cosine_similarity(matrix)

使用组合我能够生成列表，但不知道如何应用不包括零的cosine_similarity来防止填充内存。

edge_list = pd.DataFrame(list(combinations(df.index.tolist(), 2)), columns=['Source', 'Target'])

欣赏投入。谢谢，

Answer 1

这是一个非常简单的for loop方法：

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X = vect.fit_transform(df.add(' ').sum(1))

data = []
for i1, i2 in combinations(df.index.tolist(), 2):
    data.append([i1, i2,
                 cosine_similarity(X[df.index.get_loc(i1)], 
                                   X[df.index.get_loc(i2)]).ravel()[0]])
data = pd.DataFrame(data, columns=['Source','Target','Weight'])

结果：

矢量化源DF：

In [280]: X
Out[280]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [281]: X.A
Out[281]:
array([[1, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 1],
       [0, 1, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 1]], dtype=int64)

将其表示为稀疏DF：

In [282]: pd.SparseDataFrame(X, columns=vect.get_feature_names(), default_fill_value=0)
Out[282]:
   blue  gray  green  orange  red  yellow
0     1     0      1       0    1       0
1     0     0      1       1    0       1
2     0     1      1       0    0       1
3     1     0      1       0    0       1

结果DF：

In [283]: data
Out[283]:
  Source Target    Weight
0    ID1    ID2  0.333333
1    ID1    ID3  0.333333
2    ID1    ID4  0.666667
3    ID2    ID3  0.666667
4    ID2    ID4  0.666667
5    ID3    ID4  0.666667

生成加权边列表的内存有效方法

1 个答案: