使用pandas DataFrame按行值过滤的列聚合

时间:2015-06-12 01:35:07

标签: python pandas

有更好(更快)的方法吗?

我想在某一天找到与该人当天在同一地点的总销售额:

    day     name    sold    place
0   mon     Ben     2       1
1   mon     Amy     6       0
2   mon     Sue     7       1
3   mon     John    9       0
4   tues    Ben     9       1
5   tues    Amy     4       0
6   tues    Sue     10      1
7   tues    John    5       0
8   wed     Ben     8       0
9   wed     Amy     3       0
10  wed     Sue     10      1
11  wed     John    3       0

结果如下:

    day     name    sold    place   sold_at_same_place
0   mon     Ben     2       1       9
1   mon     Amy     6       0       15
2   mon     Sue     7       1       9
3   mon     John    9       0       15
4   tues    Ben     9       1       19
5   tues    Amy     4       0       9
6   tues    Sue     10      1       19
7   tues    John    5       0       9
8   wed     Ben     8       0       14
9   wed     Amy     3       0       14
10  wed     Sue     10      1       10
11  wed     John    3       0       14

如果不清楚,sold 1周一的总place为2 + 7 = 9。因为Ben有一个,他的sold_in_same_place是9.Amy的星期一sold_at_same_place是15,因为她在place 0。

这就是我的想法:

  1. 获取每个地方价值的每日总数:

    def sold_by_day_filter(df, col_name, field_value):
        """
        sums sold by day
        filtering the `col_name` on `field_value`
        """
        subset = pd.DataFrame(df[df[col_name] == field_value])
    
        aggregated_subset = pd.DataFrame(
            {str(field_value): subset.groupby(['day'])['sold'].sum()}
        ).reset_index()
    
        return aggregated_subset
    
  2. 将每个人加入原始数据集:

    for val in df['place'].unique():
        df = pd.merge(df, sold_by_day_filter(df,'place', val), on='day')
    

    现在数据集如下所示:

        day     name    sold    place   1   0   
    0   mon     Ben     2       1       9   15  
    1   mon     Amy     6       0       9   15  
    2   mon     Sue     7       1       9   15  
    3   mon     John    9       0       9   15  
    4   tues    Ben     9       1       19  9   
    5   tues    Amy     4       0       19  9   
    6   tues    Sue     10      1       19  9   
    7   tues    John    5       0       19  9   
    8   wed     Ben     8       0       10  14  
    9   wed     Amy     3       0       10  14  
    10  wed     Sue     10      1       10  14  
    11  wed     John    3       0       10  14
    
  3. 根据sold_at_same_place中的值,将值应用于place列:

    df['sold_at_same_place'] = \
        df.apply( lambda row: row[str(row['place'])], axis = 1)
    
  4. 删除临时列值(' 1'和' 0'):

    fields_to_drop = [str(field) for field in df['place'].unique()]
    df.drop(fields_to_drop, axis=1, inplace=True)
    
  5. 所以这很有效,但我觉得可能有一些简单的方法可以用Pandas做到这一点。任何建议都表示赞赏!

1 个答案:

答案 0 :(得分:3)

我认为这是使用transform

的单线程
>>> df["sold_at_same_place"] = df.groupby(["day", "place"])["sold"].transform(sum)
>>> df
     day  name  sold  place  sold_at_same_place
0    mon   Ben     2      1                   9
1    mon   Amy     6      0                  15
2    mon   Sue     7      1                   9
3    mon  John     9      0                  15
4   tues   Ben     9      1                  19
5   tues   Amy     4      0                   9
6   tues   Sue    10      1                  19
7   tues  John     5      0                   9
8    wed   Ben     8      0                  14
9    wed   Amy     3      0                  14
10   wed   Sue    10      1                  10
11   wed  John     3      0                  14

transform获取groupby结果并将结果广播回原始索引。