这是我第一次尝试熊猫。我想我有一个合理的用例,但我磕磕绊绊。我想将制表符分隔文件加载到Pandas Dataframe中,然后按符号对其进行分组,并使用TimeStamp列索引的x.axis绘制它。以下是数据的子集:
Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM
注意有关TimeStamp列的两件事;
我以为我可以做这样的事......
from pandas import *
import pylab as plt
df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)
df.plot()
plt.show()
但read_csv方法引发了一个异常“Tried columns 1-X as index,但发现重复”。是否有一个选项允许我指定具有重复值的索引列?
我也有兴趣将我的不规则时间戳间隔与一秒钟的分辨率对齐,我仍然希望在给定的秒内绘制多个事件,但也许我可以引入一个唯一的索引,然后将我的价格与它对齐?
答案 0 :(得分:5)
我刚才创建了几个问题,以解决我认为最好的一些功能/便利:GH-856,GH-857,GH-858
我们目前正在对时间序列功能进行改进,并且现在可以进行第二次分辨率的校准(虽然没有重复,所以需要为此编写一些函数)。我还想以更好的方式支持重复的时间戳。但是,这实际上是面板(3D)数据,因此可以改变的一种方法如下:
In [29]: df.pivot('Symbol', 'TimeStamp').stack()
Out[29]:
M1 M2 Price Volume
Symbol TimeStamp
FUEL 9:58:40 AM 6 1.05 3.8544 100116
9:58:47 AM 7 1.07 3.8599 102116
9:59:09 AM 8 1.11 3.9099 105265
9:59:11 AM 9 1.15 3.9490 109674
GBR 9:57:52 AM 2 0.34 3.7500 47521
9:58:20 AM 3 0.45 3.8000 63211
9:58:24 AM 4 0.46 3.8300 64251
MPET 9:57:52 AM 3 0.26 1.4200 44600
ORBC 9:59:02 AM 2 0.22 3.4000 10509
SUNH 9:59:09 AM 6 0.09 4.3700 24394
TBET 9:59:05 AM 2 8.03 2.1800 1121629
9:59:14 AM 3 8.05 2.1900 1124179
XRA 9:58:08 AM 3 0.12 3.6167 42310
请注意,这创建了一个MultiIndex。另一种方法我可以得到这个:
In [32]: df.set_index(['Symbol', 'TimeStamp'])
Out[32]:
Price M1 M2 Volume
Symbol TimeStamp
TBET 9:59:14 AM 2.1900 3 8.05 1124179
FUEL 9:59:11 AM 3.9490 9 1.15 109674
SUNH 9:59:09 AM 4.3700 6 0.09 24394
FUEL 9:59:09 AM 3.9099 8 1.11 105265
TBET 9:59:05 AM 2.1800 2 8.03 1121629
ORBC 9:59:02 AM 3.4000 2 0.22 10509
FUEL 9:58:47 AM 3.8599 7 1.07 102116
9:58:40 AM 3.8544 6 1.05 100116
GBR 9:58:24 AM 3.8300 4 0.46 64251
9:58:20 AM 3.8000 3 0.45 63211
XRA 9:58:08 AM 3.6167 3 0.12 42310
GBR 9:57:52 AM 3.7500 2 0.34 47521
MPET 9:57:52 AM 1.4200 3 0.26 44600
In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0)
Out[33]:
Price M1 M2 Volume
Symbol TimeStamp
FUEL 9:58:40 AM 3.8544 6 1.05 100116
9:58:47 AM 3.8599 7 1.07 102116
9:59:09 AM 3.9099 8 1.11 105265
9:59:11 AM 3.9490 9 1.15 109674
GBR 9:57:52 AM 3.7500 2 0.34 47521
9:58:20 AM 3.8000 3 0.45 63211
9:58:24 AM 3.8300 4 0.46 64251
MPET 9:57:52 AM 1.4200 3 0.26 44600
ORBC 9:59:02 AM 3.4000 2 0.22 10509
SUNH 9:59:09 AM 4.3700 6 0.09 24394
TBET 9:59:05 AM 2.1800 2 8.03 1121629
9:59:14 AM 2.1900 3 8.05 1124179
XRA 9:58:08 AM 3.6167 3 0.12 42310
您可以采用真正的面板格式获取此数据,如下所示:
In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
Out[35]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 11 (major) x 7 (minor)
Items: Price to Volume
Major axis: 9:57:52 AM to 9:59:14 AM
Minor axis: FUEL to XRA
In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
In [37]: panel['Price']
Out[37]:
Symbol FUEL GBR MPET ORBC SUNH TBET XRA
TimeStamp
9:57:52 AM NaN 3.75 1.42 NaN NaN NaN NaN
9:58:08 AM NaN NaN NaN NaN NaN NaN 3.6167
9:58:20 AM NaN 3.80 NaN NaN NaN NaN NaN
9:58:24 AM NaN 3.83 NaN NaN NaN NaN NaN
9:58:40 AM 3.8544 NaN NaN NaN NaN NaN NaN
9:58:47 AM 3.8599 NaN NaN NaN NaN NaN NaN
9:59:02 AM NaN NaN NaN 3.4 NaN NaN NaN
9:59:05 AM NaN NaN NaN NaN NaN 2.18 NaN
9:59:09 AM 3.9099 NaN NaN NaN 4.37 NaN NaN
9:59:11 AM 3.9490 NaN NaN NaN NaN NaN NaN
9:59:14 AM NaN NaN NaN NaN NaN 2.19 NaN
然后,您可以从该数据生成一些图表。
请注意,时间戳仍然是字符串 - 我猜它们可以转换为Python datetime.time对象,事情可能更容易使用。我没有很多计划为原始时间与时间戳(日期+时间)提供大量支持,但如果有足够的人需要它,我想我可以确信:)
如果您对一个符号的一秒钟有多个观察,那么上述某些方法将无效。但我希望在即将发布的大熊猫版本中为此提供更好的支持,因此了解您的用例对我有帮助 - 考虑加入邮件列表(pystatsmodels)