Question

我有以下Excel文件

ID     EmpName                   date           cost
1      bob smith              01/01/2019     10
2      Jane Doe               01/04/2019     20
3      steve ray, bob smith   01/03/2017     100

如果我想统计每个人的出现：鲍勃，简和史蒂夫...，但在ID 3（以及其他行）上，名称字段中的数据列出了多个雇员，这并不理想。我最好的方法是什么？

正在寻找类似的东西

employee      count       cost
bob smith     2           110
jane doe      1           20
steve ray     1           100

第二个问题：

如果我的数据如下：

ID     EmpName1      Empname2    date           cost
1      bob smith                 01/01/2019     10
2      Jane Doe                  01/04/2019     20
3      steve ray     bob smith   01/03/2017     100

可以用类似的方式计算吗？

Answer 1

使用get_dummies

s=df.EmpName.str.get_dummies(', ')
pd.concat([s.sum(),s.mul(df.cost,0).sum()],axis=1)
Out[666]: 
           0    1
Jane Doe   1   20
bobs mith  2  110
steve ray  1  100

或者我们使用unnesting

df.EmpName=df.EmpName.str.split(',')
unnesting(df,['EmpName']).groupby('EmpName').cost.agg(['sum','count'])
Out[669]: 
          sum  count
EmpName             
JaneDoe    20      1
bobsmith  110      2
steveray  100      1

更新

s=df[['EmpName1','Empname2','cost']].melt(['cost']).groupby('value').cost.agg(['sum','count'])
s.drop('')
Out[678]: 
          sum  count
value               
JaneDoe    20      1
bobsmith  110      2
steveray  100      1

或wide_to_long

pd.wide_to_long(df,['EmpName'],i=['ID'],j='number').groupby('EmpName').cost.agg(['sum','count'])

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Answer 2

您可能希望将数据重组为更像

的数据

ID     EmpName                   date           cost
1      bob smith              01/01/2019     10
2      Jane Doe               01/04/2019     20
3      steve ray              01/03/2017     100
1      bob smith              01/03/2017     100

从这一点开始，您可以使用groupby和sum语句来查找所需的内容。类似于：

df.groupby(['EmpName'])[['cost']].sum()

如果不更改此设置，则可能在以后的分析阶段导致恶梦。最好的标准是每行有一个记录，以避免以后出现错误。

当列包含多个值时，统计员工在熊猫数据框中的出现情况

2 个答案: