Question

全晚，

我的代码符合我的要求，但我很好奇为什么   的工作原理。

dft2 = pd.DataFrame(
                        np.array([
                                    ['1','A','WW'], ['1','B','XX'], ['3','A','LL'], ['1','D','ZZ'],['2','A','LL'],['3','E','LL']
                                ]), columns=['channel','state', 'rbc_security_type1']
                  )
display(dft2)


    channel state   rbc_security_type1
0   1         A          WW
1   1         B          XX
2   3         A          LL
3   1         D          ZZ
4   2         A          LL
5   3         E          LL

d = {
        ('state',np.size),
        ('rbc_security_type1',np.size)   
    }

dft2_Grp = dft2.groupby('channel')['state'].agg(d).reset_index() 
dft2_Grp = dft2.groupby('channel')['rbc_security_type1'].agg(d).reset_index() 

dft2_Grp = dft2_Grp.rename(columns={'state':'State_Count', 'rbc_security_type1':'rbc_security_type1_Count'}, level=0) # rename the column header in the groupby
display(dft2_Grp)

现在这两个聚合产生相同的输出，我想知道为什么？

dft2_Grp = dft2.groupby('channel')['state'].agg(d).reset_index() 
dft2_Grp = dft2.groupby('channel')['rbc_security_type1'].agg(d).reset_index() 


        channel State_Count rbc_security_type1_Count
    0   1           3               3
    1   2           1               1
    2   3           2               2

当我们.groupby（＆＃39; channel＆＃39;）[column] .agg（d）时到底发生了什么   当我们在多列上应用计数时？ d的聚合   （＆＃39;州＆＃39;，np.size），（＆＃39; rbc_security_type1＆＃39;，np.size）对我有意义   但为什么在聚集（d）时只需要添加一个[列]   已经有两个我想依靠的列？为什么两者都有   列不是必需的？

如果我省略[]，我认为有意义，并使用以下内容   命令我得到输出：

dft2_Grp = dft2.groupby('channel').agg(d).reset_index(). The output follows:
channel     State_Count                 rbc_security_type1_Count
            state   rbc_security_type1  state   rbc_security_type1
0   1       3           3               3           3
1   2       1           1               1           1
2   3       2           2               2           2

彼得

Answer 1

您获得相同的输出，因为调用相同的函数两次并且函数返回每组的值的计数。

d = {
        ('state',np.size),
        ('rbc_security_type1',np.size)   
    }

dft2_Grp = dft2.groupby('channel')['state'].agg(d).reset_index()

对于state每个群组的channel列，会返回2个名为state和rbc_security_type1的新列，其中包含相同的aggreagte函数np.size。

dft2_Grp = dft2.groupby('channel')['rbc_security_type1'].agg(d).reset_index()

对于rbc_security_type1每个群组的channel列，会返回2个名为state和rbc_security_type1的新列，其中包含相同的aggreagte函数np.size。

更好的是使用它：

d = {
        ('a',np.size),
        ('b','first')   
    }

dft2_Grp = dft2.groupby('channel')['state'].agg(d).reset_index() 
print(dft2_Grp)
  channel  a  b
0       1  3  A
1       2  1  A
2       3  2  A

对于columnn state，按不同的功能创建新列 - first每组返回第一个值。

d = {
        'state': np.size,
        'rbc_security_type1':np.size
    }

dft2_Grp = dft2.groupby('channel').agg(d).reset_index() 
print(dft2_Grp)
  channel  state  rbc_security_type1
0       1      3                   3
1       2      1                   1
2       3      2                   2

对于没有元组的字典聚合（更常见），在dictionary中定义了具有aggreate函数的列 - 因此state定义函数np.size和rbc_security_type1相同：

d = {
        ('a',np.size),
        ('b',np.size)   
    }

dft2_Grp = dft2.groupby('channel').agg(d).reset_index() 
print(dft2_Grp)
  channel state    rbc_security_type1   
              b  a                  b  a
0       1     3  3                  3  3
1       2     1  1                  1  1
2       3     2  2                  2  2

这意味着所有列都使用了字典中的所有函数 - 这里是双np.size并在distingush输入列的列中返回MultiIndex。

Pandas GroupBy有两列，但依赖于两个字段/ agg（）输出

1 个答案: