重新整理' tidy'与熊猫的数据框架

时间:2016-08-24 08:35:26

标签: python pandas

我用两列感兴趣的时间戳数据:a'标签'和一个计数。我想创建一个时间序列,每个标签的总和,例如,每天。我可以使用resample来实现这一目标吗?

具体例子:

import pandas as pd
import numpy as np
from itertools import cycle

idx = pd.date_range('2016-01-01', '2016-01-07', freq='H')
n = np.random.randint(10, size=24*6+1)
lst = [(l,c) for l,c in zip(cycle(['foo', 'bar']), n)]
df = pd.DataFrame(lst, index=idx, columns=['label', 'n'])

df.resample(???).sum()

对于此示例,目标数据框应包含时间索引和两列(foobar),其中包含每个间隔的总计数。

1 个答案:

答案 0 :(得分:2)

我认为groupby需要DataFrameGroupBy.resample

print (df.groupby('label')
         .resample('1D')
         .sum()
         .reset_index()
         .rename(columns={'level_1':'date'}))

   label       date   n
0    bar 2016-01-01  44
1    bar 2016-01-02  60
2    bar 2016-01-03  65
3    bar 2016-01-04  51
4    bar 2016-01-05  37
5    bar 2016-01-06  59
6    foo 2016-01-01  40
7    foo 2016-01-02  69
8    foo 2016-01-03  58
9    foo 2016-01-04  55
10   foo 2016-01-05  67
11   foo 2016-01-06  59
12   foo 2016-01-07   5

使用stackunstack处理datetimeindex的另一个解决方案:

print (df.set_index('label', append=True)
         .unstack(1)
         .resample('1D')
         .sum()
         .stack()
         .reset_index()
         .rename(columns={'level_0':'date'}))

         date label     n
0  2016-01-01   bar  44.0
1  2016-01-01   foo  40.0
2  2016-01-02   bar  60.0
3  2016-01-02   foo  69.0
4  2016-01-03   bar  65.0
5  2016-01-03   foo  58.0
6  2016-01-04   bar  51.0
7  2016-01-04   foo  55.0
8  2016-01-05   bar  37.0
9  2016-01-05   foo  67.0
10 2016-01-06   bar  59.0
11 2016-01-06   foo  59.0
12 2016-01-07   foo   5.0

如果需要两列:

df1 = df.set_index('label', append=True).unstack(1).resample('1D').sum()
df1.columns = df1.columns.droplevel(0)
print (df1)
label        bar   foo
2016-01-01  61.0  65.0
2016-01-02  54.0  56.0
2016-01-03  70.0  53.0
2016-01-04  46.0  49.0
2016-01-05  61.0  49.0
2016-01-06  50.0  55.0
2016-01-07   NaN   6.0