我在Pandas处理大约300 MB的财务数据,这与拍卖中的限价订单相对应。它是多维数据,如下所示:
bid ask
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity
2014-05-13 08:47:16.180000 102.298 1000000 102.297 1500000 102.296 6500000 102.295 8000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 6100000 102.304 14700000 102.305 3500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.003000 102.298 1000000 102.297 2500000 102.296 6500000 102.295 7000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.005000 102.298 3000000 102.297 3500000 102.296 6000000 102.295 9300000 102.294 4000000 102.293 17500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9000000 102.308 16300000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.006000 102.299 1000000 102.298 3000000 102.297 6500000 102.296 5000000 102.295 5300000 102.294 4000000 102.293 15500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 5100000 102.304 11700000 102.305 7500000 102.307 9000000 102.308 11300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.007000 102.299 1000000 102.298 3000000 102.297 8500000 102.296 4000000 102.295 4300000 102.294 5000000 102.293 14500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 4100000 102.304 13700000 102.305 7500000 102.307 8000000 102.308 12300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
(注意当你到达时,第一级会发生变化。对于表的长格式感到抱歉......)
我需要做一些数据透视操作来处理数据。例如,代替具有0,1,2,3 ...(队列中订单的相对位置),它们具有102.297,102.296,...即订单的价格作为索引。他就是这种行动的一个例子:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
得到以下特性:
quantity
side price
2014-05-13 08:47:17.003000 ask 102.300 0
102.301 0
102.302 0
102.303 -1000000
102.304 0
这可以通过stack/unstack/reset_index
的组合来实现,但它看起来效率非常低。我没有查看代码,但我猜测每个stack
/ unstack
都会生成一个表副本,导致我的8GB系统耗尽内存并开始点击页面文件。在这种情况下我不认为我可以使用pivot
,因为所需的列在多索引中
有关如何加快速度的任何建议吗?
以下是输入csv文件的示例,根据评论:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
答案 0 :(得分:1)
Unstack实际上创建了一个索引x列的枚举,因此当你有很多列和行时它可以创建一个巨大的内存空间。
这是一个解决方案,速度较慢,但应该具有更低的峰值内存使用率(我认为)。它提供了一个略小的总空间,因为你可能在原版中有一些零条目不在这里(但你可以随时重新索引并填充以修复它)。
定义此函数,这可能会针对这种情况进行优化(已经在级别上进行分组)
In [79]: def f(x):
try:
y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
return y[y!=0].dropna()
except:
return None
....:
按列上的“级别”分组并应用f;不要直接使用apply,而只是将结果作为行连接(这是'unstacking'部分)。
然而,这会产生重复(在价格水平上),因此需要聚合它们。
In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]:
value quantity
side price
2014-05-13 08:47:17.003 ask 102.303 -1000000
102.305 1000000
bid 102.295 -1000000
102.297 1000000
2014-05-13 08:47:17.005 ask 102.307 -800000
102.308 800000
bid 102.288 -5000000
102.291 3000000
102.292 -4000000
102.293 -6800000
102.294 1000000
102.295 2300000
102.296 -500000
102.297 1000000
102.298 2000000
2014-05-13 08:47:17.006 ask 102.304 -3000000
102.305 3000000
102.308 -5000000
102.309 5000000
102.310 0
102.312 0
102.313 0
102.315 0
102.316 0
102.320 0
102.321 0
102.324 0
102.325 0
bid 102.265 -9500000
102.275 0
102.276 0
102.280 0
102.284 0
102.286 0
102.287 0
102.288 0
102.289 0
102.290 0
102.291 0
102.292 0
102.293 -2000000
102.294 0
102.295 -4000000
102.296 -1000000
102.297 3000000
102.298 0
102.299 1000000
2014-05-13 08:47:17.007 ask 102.303 -1000000
102.304 2000000
102.307 -1000000
102.308 1000000
bid 102.293 -1000000
102.294 1000000
102.295 -1000000
102.296 -1000000
102.297 2000000
Timings(我认为优化f会使这一点变得更快)
In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop
In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop
原创方法
In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop
In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop