我想从非常大的scipy.sparse.coo_matrix
(此处为示例)创建pandas.DataFrame
对象,其中列位置和值要填充。但价值在列表中。
import numpy as np
import pandas as pd
from scipy import sparse
df = pd.DataFrame({'val':[[1,2,4,0,0,0],[3,0,9,0,0,12],[0,0,0,18,0,0]],'col':[0,2,3]})
>OUT: col val
0 0 [1, 2, 4, 0, 0, 0]
1 2 [3, 0, 9, 0, 0, 12]
2 3 [0, 0, 0, 18, 0, 0]
我发现一种方法是将所有内容都放入np.array
并为行和列创建索引数组。
value = df['val'].values
value = np.concatenate(value) # dissolve list structure
>OUT:[ 1 2 4 0 0 0 3 0 9 0 0 12 0 0 0 18 0 0]
col_position = df['col'].values
col_position = np.repeat(col_position,6)
>OUT:[0 0 0 0 0 0 2 2 2 2 2 2 3 3 3 3 3 3]
row_position = np.tile(np.arange(6),3) #to have 6 rows
>OUT:[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
result = sparse.coo_matrix((value,(row_position, col_position)), shape=(6, 4))
print result.todense()
>OUT: [[ 1 0 3 0]
[ 2 0 0 0]
[ 4 0 9 0]
[ 0 0 0 18]
[ 0 0 0 0]
[ 0 0 12 0]]
我想知道有更快的方法吗?我的真实数据大约有100,000行和13,000列