如何将列中包含字符串的数据帧转换为csr_matrix

时间:2017-05-15 05:48:48

标签: python csr

我正在解决PMI问题,到目前为止,我有一个像这样的数据框:

w = ['by', 'step', 'by', 'the', 'is', 'step', 'is', 'by', 'is']
c = ['step', 'what', 'is', 'what', 'the', 'the', 'step', 'the', 'what']
ppmi = [1, 3, 12, 3, 123, 1, 321, 1, 23]
df = pd.DataFrame({'w':w, 'c':c, 'ppmi': ppmi})

我想将此数据帧转换为稀疏矩阵。由于wc是字符串列表,如果我csr_matrix((ppmi, (w, c))),则会给我一个错误TypeError: cannot perform reduce with flexible type。转换此数据帧的另一种方法是什么?

1 个答案:

答案 0 :(得分:0)

也许你可以试试coo_matrix

import pandas as pd
import scipy.sparse as sps
w = ['by', 'step', 'by', 'the', 'is', 'step', 'is', 'by', 'is']
c = ['step', 'what', 'is', 'what', 'the', 'the', 'step', 'the', 'what']
ppmi = [1, 3, 12, 3, 123, 1, 321, 1, 23]
df = pd.DataFrame({'w':w, 'c':c, 'ppmi': ppmi})
df.set_index(['w', 'c'], inplace=True)
mat = sps.coo_matrix((df['ppmi'],(df.index.labels[0], df.index.labels[1])))
print(mat.todense())

输出:

[[ 12   1   1   0]
 [  0 321 123  23]
 [  0   0   1   3]
 [  0   0   0   3]]