熊猫按不同的时间间隔取平均值并按列分组

时间:2021-03-20 09:16:30

标签: python pandas

我有这个熊猫数据集:

<头>
日期 一个 b c d col1 col2
01-12-20 0 0 0 0 1 0.2
02-12-20 0 0 0 0 2 0.4
03-12-20 0 0 0 0 5 5
03-12-20 1 1 0 0 6 1
03-12-20 0 1 0 0 5 1
05-12-20 0 0 0 0 2 0.3
07-12-20 1 0 1 0 3 6
08-12-20 0 1 0 1 8 7
08-12-20 0 0 1 0 7 1
10-12-20 0 0 0 0 4 1
11-12-20 0 0 0 1 3 0

以及以下时间间隔列表

from datetime import date
timeintervals = [date(2020, 12, 1), date(2020, 12, 3), date(2020, 12, 7), date(2020, 12, 11), date(2020, 12, 21)]
groupcol = ["a","b","c","d"]

我需要获取一个新的数据集,其中:

  • 我得到 col1 和 col2 的平均值,按时间间隔和 groupcol 变量中列出的列分组(仅考虑带有 1 而不是带有零的行)。 这就像说每个时间间隔的“a”的平均值,当“a”为 1 时。如果“a”在一个间隔中全部为零,则结果平均值应该为零。
  • “abcd”列的名称可以是任何名称;
  • 为了定义时间间隔,我们考虑提供的列表,使得 timeintervals[i] <= date < timeintervals[i+1] 并递归进行;
  • 分组 df 中的时间间隔以其上限减去一天来标记。

结果应该是这样的:

<头>
日期 abcd col1 col2
02-12-20 0 0
02-12-20 b 0 0
02-12-20 c 0 0
02-12-20 d 0 0
06-12-20 6 1
06-12-20 b 5.5 1
06-12-20 c 0 0
06-12-20 d 0 0
10-12-20 3 6
10-12-20 b 8 7
10-12-20 c 5 3.5
10-12-20 d 8 7
20-12-20 0 0
20-12-20 b 0 0
20-12-20 c 0 0
20-12-20 d 3 0

我无法为此找到有效的解决方案。

1 个答案:

答案 0 :(得分:1)

让我们一步一步来

首先我们添加一列,指定我们将每个日期分配到哪个组

df['dgr'] = df['date'].apply(lambda d: [d.date()<l for l in timeintervals].index(True)-1)
df

获得

    date                   a    b    c    d    col1    col2    dgr
--  -------------------  ---  ---  ---  ---  ------  ------  -----
 0  2020-12-01 00:00:00    0    0    0    0       1     0.2      0
 1  2020-12-02 00:00:00    0    0    0    0       2     0.4      0
 2  2020-12-03 00:00:00    0    0    0    0       5     5        1
 3  2020-12-03 00:00:00    1    1    0    0       6     1        1
 4  2020-12-03 00:00:00    0    1    0    0       5     1        1
 5  2020-12-05 00:00:00    0    0    0    0       2     0.3      1
 6  2020-12-07 00:00:00    1    0    1    0       3     6        2
 7  2020-12-08 00:00:00    0    1    0    1       8     7        2
 8  2020-12-08 00:00:00    0    0    1    0       7     1        2
 9  2020-12-10 00:00:00    0    0    0    0       4     1        2
10  2020-12-11 00:00:00    0    0    0    1       3     0        3

以下命令准备最终结果的“骨架”——基本上将 a,b,c,d 展开到一列中,并在原始 df 中跟踪 0,1s

df2 = (df[groupcol]
    .stack()
    .reset_index(level = 1)
    .rename(columns = {'level_1' : 'abcd',0:'incl'})
)
df2

得到

    abcd      incl
--  ------  ------
 0  a            0
 0  b            0
 0  c            0
 0  d            0
 1  a            0
 1  b            0
 1  c            0
 1  d            0
 2  a            0
 2  b            0
 2  c            0
 2  d            0
 3  a            1
 3  b            1
 3  c            0
 3  d            0
 4  a            0
 4  b            1
 4  c            0
 4  d            0
 5  a            0
 5  b            0
 5  c            0
 5  d            0
 6  a            1
 6  b            0
 6  c            1
 6  d            0
 7  a            0
 7  b            1
 7  c            0
 7  d            1
 8  a            0
 8  b            0
 8  c            1
 8  d            0
 9  a            0
 9  b            0
 9  c            0
 9  d            0
10  a            0
10  b            0
10  c            0
10  d            1

现在我们可以加入df2和df来填写日期,col1,col2的详细信息

df2 = df2.join(df[['date','col1','col2','dgr']])

得到

    abcd      incl  date                   col1    col2    dgr
--  ------  ------  -------------------  ------  ------  -----
 0  a            0  2020-12-01 00:00:00       1     0.2      0
 0  b            0  2020-12-01 00:00:00       1     0.2      0
 0  c            0  2020-12-01 00:00:00       1     0.2      0
 0  d            0  2020-12-01 00:00:00       1     0.2      0
 1  a            0  2020-12-02 00:00:00       2     0.4      0
 1  b            0  2020-12-02 00:00:00       2     0.4      0
 1  c            0  2020-12-02 00:00:00       2     0.4      0
 1  d            0  2020-12-02 00:00:00       2     0.4      0
 2  a            0  2020-12-03 00:00:00       5     5        1
 2  b            0  2020-12-03 00:00:00       5     5        1
 2  c            0  2020-12-03 00:00:00       5     5        1
 2  d            0  2020-12-03 00:00:00       5     5        1
 3  a            1  2020-12-03 00:00:00       6     1        1
 3  b            1  2020-12-03 00:00:00       6     1        1
 3  c            0  2020-12-03 00:00:00       6     1        1
 3  d            0  2020-12-03 00:00:00       6     1        1
 4  a            0  2020-12-03 00:00:00       5     1        1
 4  b            1  2020-12-03 00:00:00       5     1        1
 4  c            0  2020-12-03 00:00:00       5     1        1
 4  d            0  2020-12-03 00:00:00       5     1        1
 5  a            0  2020-12-05 00:00:00       2     0.3      1
 5  b            0  2020-12-05 00:00:00       2     0.3      1
 5  c            0  2020-12-05 00:00:00       2     0.3      1
 5  d            0  2020-12-05 00:00:00       2     0.3      1
 6  a            1  2020-12-07 00:00:00       3     6        2
 6  b            0  2020-12-07 00:00:00       3     6        2
 6  c            1  2020-12-07 00:00:00       3     6        2
 6  d            0  2020-12-07 00:00:00       3     6        2
 7  a            0  2020-12-08 00:00:00       8     7        2
 7  b            1  2020-12-08 00:00:00       8     7        2
 7  c            0  2020-12-08 00:00:00       8     7        2
 7  d            1  2020-12-08 00:00:00       8     7        2
 8  a            0  2020-12-08 00:00:00       7     1        2
 8  b            0  2020-12-08 00:00:00       7     1        2
 8  c            1  2020-12-08 00:00:00       7     1        2
 8  d            0  2020-12-08 00:00:00       7     1        2
 9  a            0  2020-12-10 00:00:00       4     1        2
 9  b            0  2020-12-10 00:00:00       4     1        2
 9  c            0  2020-12-10 00:00:00       4     1        2
 9  d            0  2020-12-10 00:00:00       4     1        2
10  a            0  2020-12-11 00:00:00       3     0        3
10  b            0  2020-12-11 00:00:00       3     0        3
10  c            0  2020-12-11 00:00:00       3     0        3
10  d            1  2020-12-11 00:00:00       3     0        3

现在我们可以根据需要通过 dgr、abcd 和 label 进行聚合

df2[['col1','col2']] = df2[['col1','col2']].mul(df2['incl'],axis=0)
df3 = df2.groupby(['dgr','abcd']).agg(sum).reset_index()
df3[['col1','col2']] = df3[['col1','col2']].div(df3['incl'],axis=0).fillna(0)
df3['date'] = df3['dgr'].map(lambda g: timeintervals[g+1] + timedelta(days = -1))
df3.drop(columns = ['incl','dgr'])

用于最终输出

    abcd      col1    col2  date
--  ------  ------  ------  ----------
 0  a          0       0    2020-12-02
 1  b          0       0    2020-12-02
 2  c          0       0    2020-12-02
 3  d          0       0    2020-12-02
 4  a          6       1    2020-12-06
 5  b          5.5     1    2020-12-06
 6  c          0       0    2020-12-06
 7  d          0       0    2020-12-06
 8  a          3       6    2020-12-10
 9  b          8       7    2020-12-10
10  c          5       3.5  2020-12-10
11  d          8       7    2020-12-10
12  a          0       0    2020-12-20
13  b          0       0    2020-12-20
14  c          0       0    2020-12-20
15  d          3       0    2020-12-20