Question

我有一个看起来如下的数据集 - 实际的东西要大得多（> 300K行），但这应该做。

    datetime                type    price   bid?    quantity    order book
0   2017-03-01 09:30:00.656 quote   6.15    T       800000.0    2493
1   2017-03-01 09:30:00.656 quote   6.20    T       800000.0    2493
2   2017-03-03 09:30:00.657 quote   6.25    F       800000.0    2493
3   2017-03-04 09:30:00.669 quote   6.15    T       2600000.0   2493
4   2017-03-10 09:30:00.669 quote   6.30    F       800000.0    2493
5   2017-03-28 09:30:00.669 quote   6.35    F       800000.0    2493
6   2017-03-28 09:30:00.682 quote   6.25    F       1200000.0   2493
7   2017-03-30 09:30:00.684 quote   6.20    T       2300000.0   2493

我在这里想要实现的是在数据集中的所有日期循环一个函数。更具体地说，我试图每天进行我的分析。到目前为止我尝试的是：

for date in y['datetime'].dt.date():
print(date)

和

y.groupby(columns=y['datetime'].dt.date())

但两种方法都会导致

TypeError: 'Series' object is not callable

任何帮助将不胜感激。谢谢！

Answer 1

我认为您需要groupby date和apply函数f每天循环播放：

def f(x):
    #sample function
    print (x)
    x['price'] = x['price'] * 2 + x['quantity']
    ... 
    return x

df = y.groupby(y['datetime'].dt.date).apply(f)
print (df)

或使用resample - 它会创建连续的DatetimeIndex，但如果缺少某些日期，请添加NaN s：

y.resample('D', on='datetime').apply(f)

Answer 2

好吧，不要循环。

如果你有300k行，那么循环将非常慢并且不是最理想的。

这是另一种解决方案：

使用时间序列数据的一种常见做法是使用时间戳作为行索引。

为此，您可以： y = y.set_index('datetime)

之后，如果您想获得日期，您可以简单地： dates = y.index.date

在datetime中循环使用日期

2 个答案: