我有一个33620x160 pandas
DataFrame
,其中有一列包含数字列表。 DataFrame
中的每个列表条目都包含30个元素。
df['dlrs_col']
0 [0.048142470608688, 0.047021138711858, 0.04573...
1 [0.048142470608688, 0.047021138711858, 0.04573...
2 [0.048142470608688, 0.047021138711858, 0.04573...
3 [0.048142470608688, 0.047021138711858, 0.04573...
4 [0.048142470608688, 0.047021138711858, 0.04573...
5 [0.048142470608688, 0.047021138711858, 0.04573...
6 [0.048142470608688, 0.047021138711858, 0.04573...
7 [0.048142470608688, 0.047021138711858, 0.04573...
8 [0.048142470608688, 0.047021138711858, 0.04573...
9 [0.048142470608688, 0.047021138711858, 0.04573...
10 [0.048142470608688, 0.047021138711858, 0.04573...
我正在创建一个33620x30数组,其条目是该单个DataFrame
列的未列出值。我现在这样做:
np.array(df['dlrs_col'].tolist(), dtype = 'float64')
这很好用,但需要花费大量时间,特别是考虑到我为6个额外的列列进行类似的计算时。关于如何加快速度的任何想法?
答案 0 :(得分:1)
你可以这样做:
In [140]: df
Out[140]:
dlrs_col
0 [0.048142470608688, 0.047021138711858, 0.04573]
1 [0.048142470608688, 0.047021138711858, 0.04573]
2 [0.048142470608688, 0.047021138711858, 0.04573]
3 [0.048142470608688, 0.047021138711858, 0.04573]
4 [0.048142470608688, 0.047021138711858, 0.04573]
5 [0.048142470608688, 0.047021138711858, 0.04573]
6 [0.048142470608688, 0.047021138711858, 0.04573]
7 [0.048142470608688, 0.047021138711858, 0.04573]
8 [0.048142470608688, 0.047021138711858, 0.04573]
9 [0.048142470608688, 0.047021138711858, 0.04573]
In [141]: df.dlrs_col.apply(pd.Series)
Out[141]:
0 1 2
0 0.048142 0.047021 0.04573
1 0.048142 0.047021 0.04573
2 0.048142 0.047021 0.04573
3 0.048142 0.047021 0.04573
4 0.048142 0.047021 0.04573
5 0.048142 0.047021 0.04573
6 0.048142 0.047021 0.04573
7 0.048142 0.047021 0.04573
8 0.048142 0.047021 0.04573
9 0.048142 0.047021 0.04573
In [142]: df.dlrs_col.apply(pd.Series).values
Out[142]:
array([[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ],
[ 0.04814247, 0.04702114, 0.04573 ]])
答案 1 :(得分:0)
您可以先按values
转换为numpy array
:
df = pd.DataFrame({'dlrs_col':[
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573]]})
print (df)
dlrs_col
0 [0.048142470608688, 0.047021138711858, 0.04573]
1 [0.048142470608688, 0.047021138711858, 0.04573]
2 [0.048142470608688, 0.047021138711858, 0.04573]
3 [0.048142470608688, 0.047021138711858, 0.04573]
4 [0.048142470608688, 0.047021138711858, 0.04573]
5 [0.048142470608688, 0.047021138711858, 0.04573]
6 [0.048142470608688, 0.047021138711858, 0.04573]
7 [0.048142470608688, 0.047021138711858, 0.04573]
print (np.array(df['dlrs_col'].values.tolist(), dtype = 'float64'))
[[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]
[ 0.04814247 0.04702114 0.04573 ]]
<强>计时强>:
In [56]: %timeit (np.array(df['dlrs_col'].values.tolist(), dtype = 'float64'))
The slowest run took 9.76 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop
In [57]: %timeit (np.array(df['dlrs_col'].tolist(), dtype = 'float64'))
The slowest run took 9.33 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28.4 µs per loop