假设拥有以下DataFrame
:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
{
"datetime": np.random.choice(rng,n),
"cat": np.random.choice(['a','b','b'], n),
"val": np.random.randint(0,5, size=n)
}
)
如果我现在groupby
:
gb = df.groupby(['cat','datetime']).sum()
我每小时得到每个cat
的总数:
cat datetime val
a 2011-01-01 00:00:00 1
2011-01-01 09:00:00 3
2011-01-02 16:00:00 1
2011-01-03 16:00:00 1
b 2011-01-01 08:00:00 4
2011-01-01 15:00:00 3
2011-01-01 16:00:00 3
2011-01-02 04:00:00 4
2011-01-02 05:00:00 1
2011-01-02 12:00:00 4
但是,我希望有类似的东西:
cat datetime val
a 2011-01-01 4
2011-01-02 1
2011-01-03 1
b 2011-01-01 10
2011-01-02 9
我可以通过添加另一个名为date
的列来获得所需的结果:
df['date'] = df.datetime.apply(pd.datetime.date)
然后执行类似的groupby
:df.groupby(['cat','date']).sum()
。但我感兴趣的是,有更多的pythonic方式吗?另外,我可能想看看月份或年级。那么,什么是正确的方式?
答案 0 :(得分:0)
您可以cat
和set_index
尝试groupby
然后date
:
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
{
"datetime": np.random.choice(rng,n),
"cat": np.random.choice(['a','b','b'], n),
"val": np.random.randint(0,5, size=n)
}
)
print df
cat datetime val
0 a 2011-01-01 09:00:00 3
1 b 2011-01-01 15:00:00 3
2 a 2011-01-03 16:00:00 1
3 b 2011-01-02 04:00:00 4
4 b 2011-01-02 05:00:00 1
5 b 2011-01-01 08:00:00 4
6 a 2011-01-01 00:00:00 1
7 a 2011-01-02 16:00:00 1
8 b 2011-01-02 12:00:00 4
9 b 2011-01-01 16:00:00 3
df = df.set_index('datetime')
gb = df.groupby(['cat', lambda x: x.date]).sum()
print gb
val
cat
a 2011-01-01 4
2011-01-02 1
2011-01-03 1
b 2011-01-01 10
2011-01-02 9
答案 1 :(得分:0)
在中间结构中,您可以使用.unstack
分隔类别,再次执行.resample
,然后.stack
再次返回原始表单:
In [126]: gb = df.groupby(['cat', 'datetime']).sum()
In [127]: gb.unstack(0)
Out[127]:
val
cat a b
datetime
2011-01-01 00:00:00 1.0 NaN
2011-01-01 08:00:00 NaN 4.0
2011-01-01 09:00:00 3.0 NaN
2011-01-01 15:00:00 NaN 3.0
2011-01-01 16:00:00 NaN 3.0
2011-01-02 04:00:00 NaN 4.0
2011-01-02 05:00:00 NaN 1.0
2011-01-02 12:00:00 NaN 4.0
2011-01-02 16:00:00 1.0 NaN
2011-01-03 16:00:00 1.0 NaN
In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
val
datetime cat
2011-01-01 a 4.0
b 10.0
2011-01-02 a 1.0
b 9.0
2011-01-03 a 1.0
编辑:对于其他重新采样频率(月,年等),pandas resample documentation
有一个很好的选项列表