熊猫对数据框进行分组或重新采样,不包括列

时间:2020-03-17 11:15:03

标签: python pandas dataframe

import pandas as pd
import numpy as np
data = {'dateTimeGmt': {0: pd.Timestamp('2020-01-01 06:44:00'),
      1: pd.Timestamp('2020-01-01 06:45:00'),      2: pd.Timestamp('2020-01-01 07:11:00'),      3: pd.Timestamp('2020-01-01 07:12:00'),      4: pd.Timestamp('2020-01-01 07:12:00'),      5: pd.Timestamp('2020-01-01 07:14:00'),      6: pd.Timestamp('2020-01-01 10:04:00'),      7: pd.Timestamp('2020-01-01 10:04:00'),      8: pd.Timestamp('2020-01-01 11:45:00'),      9: pd.Timestamp('2020-01-01 06:45:00')},
     'id': {0: 4, 1: 4, 2: 4, 3: 5, 4: 5, 5: 5, 6: 5, 7: 6, 8: 6, 9: 6},
     'name': {0: 'four',      1: 'four',      2: 'four',      3: 'five',      4: 'five',      5: 'five',      6: 'five',      7: 'six',      8: 'six',      9: 'six'},     'a': {0: 1.0,      1: np.nan,      2: np.nan,      3: np.nan,      4: np.nan,      5: np.nan,      6: np.nan,      7: 5.0,      8: np.nan,      9: np.nan},     'b': {0: np.nan,      1: 3.0,      2: np.nan,      3: np.nan,      4: np.nan,      5: np.nan,      6: np.nan,      7: np.nan,      8: np.nan,      9: 3.0},     'c': {0: np.nan,      1: np.nan,      2: np.nan,      3: np.nan,      4: 2.0,      5: np.nan,      6: np.nan,      7: np.nan,      8: 0.0,      9: np.nan}}
df = pd.DataFrame(data)

我想展平数据框,以便将name之后的所有列按dateTimeGmt中的小时进行分组,然后按id / name进行分组。

我尝试了df2 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name]).sum(),这似乎可行,但是将所有分组列组合到了索引中。

df3 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name], as_index = False).sum()保留idname,但dateTimeGmt数据丢失。

如何在不丢失分组依据的列的情况下对数据进行分组?

1 个答案:

答案 0 :(得分:3)

在您的解决方案中,有必要为rename添加datehour的列名,以避免重复的列名,最后添加DataFrame.reset_index

df2 = (df.groupby([df.dateTimeGmt.dt.date.rename('date'),
                   df.dateTimeGmt.dt.hour.rename('h'), 'id', 'name'])
         .sum()
         .reset_index())
print (df2)
         date   h  id  name    a    b    c
0  2020-01-01   6   4  four  1.0  3.0  0.0
1  2020-01-01   6   6   six  0.0  3.0  0.0
2  2020-01-01   7   4  four  0.0  0.0  0.0
3  2020-01-01   7   5  five  0.0  0.0  2.0
4  2020-01-01  10   5  five  0.0  0.0  0.0
5  2020-01-01  10   6   six  5.0  0.0  0.0
6  2020-01-01  11   6   six  0.0  0.0  0.0

或者可以按小时频率使用Grouper

df2 = df.groupby([pd.Grouper(freq='H', key='dateTimeGmt'), 'id', 'name']).sum().reset_index()
print (df2)
          dateTimeGmt  id  name    a    b    c
0 2020-01-01 06:00:00   4  four  1.0  3.0  0.0
1 2020-01-01 06:00:00   6   six  0.0  3.0  0.0
2 2020-01-01 07:00:00   4  four  0.0  0.0  0.0
3 2020-01-01 07:00:00   5  five  0.0  0.0  2.0
4 2020-01-01 10:00:00   5  five  0.0  0.0  0.0
5 2020-01-01 10:00:00   6   six  5.0  0.0  0.0
6 2020-01-01 11:00:00   6   six  0.0  0.0  0.0