熊猫通过同一数据框中的多列映射分组依据

时间:2019-03-19 11:56:18

标签: python pandas

我有一个数据框df,我需要根据条件对多列进行分组。

df

user_id       area_id         group_id key year value     new
10835          48299            1      5   2011   0        ?
10835           48299           1      2   2010   0
10835           48299           2     102  2013   13100
10835           48299           2      5   2016   0
10836           48299           1      78  2017   67100
10836           48299           1      1   2012   54000
10836           48299           1      12  2018   0
10836           48752           1      7   2014   0
10836           48752           2     103  2015   5000
10837           48752           2     102  2016   5000
10837           48752           1      3   2017   0
10837           48752           1     103  2017   0
10837           49226           1      2   2011   4000
10837           49226           1     83   2011   4000
10838           49226           2     16   2011   0
10838           49226           1     75   2012   0
10838           49226           1      2   2012   4000
10838           49226           1      12  2013   1000
10839           49226           1      3   2015   6500
10839           49226           1     102  2016   7900
10839           49226           1     16   2017   0
10839           49226           2     6    2017   5500
22489           49226           2     89   2017   5000
22489           49226           1     102  2017   5000

我的目标是创建一个新列df['new'] 当前解决方案:

df['new'] =df['user_id'].map(df[df['key'].eq(102)].groupby(['user_id', 'area_id', 'group_id', 'year'])['value'].sum())

对于所有NaN值,我得到df['new']。我猜想不可能使用map function以此方式对多列进行分组。是否有适当的方法来实现这一目标?预先感谢您指出正确的方向。

1 个答案:

答案 0 :(得分:1)

您可以为新的as_index=False添加DataFrame

df1 = (df[df['key'].eq(102)]
             .groupby(['user_id', 'area_id', 'group_id', 'year'], as_index=False)['value']
             .sum())
print (df1)
   user_id  area_id  group_id  year  value
0    10835    48299         2  2013  13100
1    10837    48752         2  2016   5000
2    10839    49226         1  2016   7900
3    22489    49226         1  2017   5000

然后,如有可能,重复的user_id首先由DataFrame.drop_duplicates获得唯一行,由DataFrame.set_indexSeries创建map

df['new'] = df['user_id'].map(df1.drop_duplicates('user_id').set_index('user_id')['value'])
#if never duplicates
#df['new'] = df['user_id'].map(df1.set_index('user_id')['value'])
print (df)
    user_id  area_id  group_id  key  year  value      new
0     10835    48299         1    5  2011      0  13100.0
1     10835    48299         1    2  2010      0  13100.0
2     10835    48299         2  102  2013  13100  13100.0
3     10835    48299         2    5  2016      0  13100.0
4     10836    48299         1   78  2017  67100      NaN
5     10836    48299         1    1  2012  54000      NaN
6     10836    48299         1   12  2018      0      NaN
7     10836    48752         1    7  2014      0      NaN
8     10836    48752         2  103  2015   5000      NaN
9     10837    48752         2  102  2016   5000   5000.0
10    10837    48752         1    3  2017      0   5000.0
11    10837    48752         1  103  2017      0   5000.0
12    10837    49226         1    2  2011   4000   5000.0
13    10837    49226         1   83  2011   4000   5000.0
14    10838    49226         2   16  2011      0      NaN
15    10838    49226         1   75  2012      0      NaN
16    10838    49226         1    2  2012   4000      NaN
17    10838    49226         1   12  2013   1000      NaN
18    10839    49226         1    3  2015   6500   7900.0
19    10839    49226         1  102  2016   7900   7900.0
20    10839    49226         1   16  2017      0   7900.0
21    10839    49226         2    6  2017   5500   7900.0
22    22489    49226         2   89  2017   5000   5000.0
23    22489    49226         1  102  2017   5000   5000.0