如何创建具有多个列的模糊时间序列数据框

时间:2019-08-22 07:28:47

标签: dask

我在创建一个模糊的时间序列数据框时遇到麻烦,该数据框可以计算多列中的每小时平均值。

这是我输入的csv文件的示例:

name,date_time,num
dan,2019-01-02 00:00:00,3
ben,2019-01-02 00:00:00,7
dan,2019-01-02 02:00:00,13
dan,2019-01-02 10:00:00,9
dan,2019-01-02 10:01:00,3
ben,2019-01-02 14:22:00,66
ben,2019-01-02 14:37:00,37

我可以用熊猫产生想要的输出

import pandas as pd
from matplotlib import pyplot

df = pd.read_csv('my_file.csv')

df['timestamp'] = pd.to_datetime(df.date_time)
df = df.set_index(df.timestamp) # set a datetime index

df = df.groupby('name').resample('H')['num'].mean().unstack('name')

df.fillna(0).plot()

所需的输出

            name    ben dan
timestamp       
2019-01-02 00:00:00 7.0 3.0
2019-01-02 01:00:00 NaN NaN
2019-01-02 02:00:00 NaN 13.0
2019-01-02 03:00:00 NaN NaN
2019-01-02 04:00:00 NaN NaN
2019-01-02 05:00:00 NaN NaN
2019-01-02 06:00:00 NaN NaN
2019-01-02 07:00:00 NaN NaN
2019-01-02 08:00:00 NaN NaN
2019-01-02 09:00:00 NaN NaN
2019-01-02 10:00:00 NaN 6.0
2019-01-02 11:00:00 NaN NaN
2019-01-02 12:00:00 NaN NaN
2019-01-02 13:00:00 NaN NaN
2019-01-02 14:00:00 51.5 NaN

我试图用dask产生相同的数据框

from dask import dataframe as dd
from matplotlib import pyplot

ddf = dd.read_csv('my_file.csv')

# setting an index
ddf['timestamp'] = dd.to_datetime(ddf.date_time)
ddf = ddf.set_index(ddf.timestamp)
ddf.repartition(freq='MS')

ddf.groupby('name').resample('H')['num'].mean()

当我运行上面的代码时,出现此错误:

AttributeError: 'Column not found: resample'

这真的让我很困惑,任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

看起来dask数据框未实现groupby-resample操作。听起来您有功能请求。我建议在https://github.com/dask/dask/issues/new

提出问题

有关在哪里寻求帮助的请求,请参见https://docs.dask.org/en/latest/support.html#asking-for-help