使用pandas工具,是否有任何快速方法可以在多年,多索引和每小时数据集中随机抽取每天N小时?我的目标是每天获得N个随机小时和每个X,Y对。
如果我的数据如此:
In [21]: df
Out[21]:
Stuff
Date X Y
2004-01-01 02:00:00 0 1 1.047065
2004-01-01 03:00:00 0 1 -1.048725
2004-01-01 04:00:00 0 1 -0.245098
2004-01-01 05:00:00 0 1 0.452306
2004-01-01 02:00:00 2 3 0.100935
2004-01-01 03:00:00 2 3 -1.183009
2004-01-01 04:00:00 2 3 0.164260
2004-01-01 05:00:00 2 3 -1.013031
2004-01-01 02:00:00 4 2 -0.300900
2004-01-01 03:00:00 4 2 0.698377
2004-01-01 04:00:00 4 2 0.335517
2004-01-01 05:00:00 4 2 -0.421466
2004-01-01 02:00:00 7 9 -0.904358
2004-01-01 03:00:00 7 9 1.496770
2004-01-01 04:00:00 7 9 -0.966784
2004-01-01 05:00:00 7 9 0.101442
2004-01-02 02:00:00 0 1 0.771495
2004-01-02 03:00:00 0 1 -1.559194
2004-01-02 04:00:00 0 1 0.497352
2004-01-02 05:00:00 0 1 0.377913
2004-01-02 02:00:00 2 3 0.637454
2004-01-02 03:00:00 2 3 -0.381010
2004-01-02 04:00:00 2 3 1.973359
2004-01-02 05:00:00 2 3 0.390250
2004-01-02 02:00:00 4 2 0.948655
2004-01-02 03:00:00 4 2 0.234342
2004-01-02 04:00:00 4 2 0.766474
2004-01-02 05:00:00 4 2 -0.529767
2004-01-02 02:00:00 7 9 0.682759
2004-01-02 03:00:00 7 9 2.202768
2004-01-02 04:00:00 7 9 2.190237
2004-01-02 05:00:00 7 9 -1.641499
我希望得到一个类似于(如果N = 2)的结果:
Stuff
Date X Y
2004-01-01 02:00:00 0 1 1.047065
2004-01-01 05:00:00 0 1 0.452306
2004-01-01 04:00:00 2 3 0.164260
2004-01-01 05:00:00 2 3 -1.013031
2004-01-01 02:00:00 4 2 -0.300900
2004-01-01 03:00:00 4 2 0.698377
2004-01-01 02:00:00 7 9 -0.904358
2004-01-01 05:00:00 7 9 0.101442
2004-01-02 03:00:00 0 1 -1.559194
2004-01-02 04:00:00 0 1 0.497352
2004-01-02 04:00:00 2 3 1.973359
2004-01-02 05:00:00 2 3 0.390250
2004-01-02 02:00:00 4 2 0.948655
2004-01-02 05:00:00 4 2 -0.529767
2004-01-02 04:00:00 7 9 2.190237
2004-01-02 05:00:00 7 9 -1.641499
答案 0 :(得分:2)
更新:您已将问题更改为按X和Y分组以及时间。要使用TimeGrouper
(如下所述,在我对原始问题的回答中)以及其他分组标准(例如['X', 'Y']
),请参阅this answer。
每小时一次,并将transform
与this answer一起使用,如下所示:
df.groupby(pd.TimeGrouper('H')).transform(lambda x: x[random.sample(x.index, N)])
示例:我生成一个每小时有多个样本的数据集,我每小时随机选择两个。
In [62]: df = DataFrame(np.random.randn(6), pd.date_range(freq='20T', start=pd.datetime.now(), periods=6))
In [63]: df
Out[63]:
0
2013-10-08 14:18:49 0.709713
2013-10-08 14:38:49 1.413776
2013-10-08 14:58:49 -0.725483
2013-10-08 15:18:49 1.251557
2013-10-08 15:38:49 -1.049705
2013-10-08 15:58:49 1.100699
In [65]: df.groupby(pd.TimeGrouper('H')).transform(lambda x: x[random.sample(x.index, 2)])
Out[65]:
0
2013-10-08 14:18:49 0.709713
2013-10-08 14:58:49 -0.725483
2013-10-08 15:38:49 -1.049705
2013-10-08 15:58:49 1.100699
我在内置模块random
中使用过。 numpy版本1.7将为相同的功能添加numpy.choice
,我假设有点快。