有人建议我为什么不能对最后一个数据帧进行求和吗?
如果有更短的方法来完成拆分标签和汇总频率,也欢迎提出建议。
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)pandas_profiling
print("\nCalculate aggregates for tags :\n")
TagsDFGroupBy = df.groupby(['Tags','Lab Location' ]).agg({'ADO ID': ['count']}).rename(columns={'ADO ID':'WorkItemCnt'}).reset_index()
print(TagsDFGroupBy)
产生输出
Tags | Labs | WorkItemCNT | --------------------------------------------------------------| ----- | --- |
0| A2040|RXY Lab|1 |
1| AWAITING COMMODITY QUAL|RXY Lab|1 |
2| DNR|RXY Lab|18 |
3| DNR; MISSING SKU DOC|RXY Lab|17 |
4| MISSING QUAL INTAKE REQUEST; MISSING SKU DOC; NEED HARDWARE|QXR Lab|1 |
5| MISSING SKU DOC|RXY Lab|2 |
6| MISSING SKU DOC; NEED HARDWARE|RXY Lab|1 |
7| MISSING SKU DOC; NEED RA|RXY Lab|1 |
8| NEED HARDWARE|RXY Lab|7 |
9| NEED HARDWARE|VYZ Lab|4 |
然后我运行代码来拆分标签并对频率求和
print("\nSplit tags by semicolumn delimiter" )
TagsDFGroupBy[['Tag1','Tag2','Tag3']] = TagsDFGroupBy.Tags.str.split(";",expand=True)
print("\nReplace none with blanks")
mask = TagsDFGroupBy.applymap(lambda x: x is None)
cols = TagsDFGroupBy.columns[(mask).any()]
for col in TagsDFGroupBy[cols]:
TagsDFGroupBy.loc[mask[col], col] = ''
print("\n3 different dataframes")
TagsDFGroupBy1 = TagsDFGroupBy[['Lab Location','Tag1','WorkItemCnt']].rename(columns={'Tag1':'TagSplit'})
TagsDFGroupBy2 = TagsDFGroupBy[['Lab Location','Tag2','WorkItemCnt']].rename(columns={'Tag2':'TagSplit'})
TagsDFGroupBy3 = TagsDFGroupBy[['Lab Location','Tag3','WorkItemCnt']].rename(columns={'Tag3':'TagSplit'})
print("\nCombine 3 different dataframes into 1")
TagsConcat = pd.concat([TagsDFGroupBy1, TagsDFGroupBy2, TagsDFGroupBy3], ignore_index=True)
# Get names of indexes for which TagSplit has a blank value
indexNames = TagsConcat[TagsConcat['TagSplit'] == '' ].index
# Delete these row indexes from dataFrame
TagsConcat.drop(indexNames , inplace=True)
TagsConcat.reset_index()
print('TagsConcat')
print(TagsConcat)
产生这个输出
Lab Location TagSplit WorkItemCnt
count
--------------|-------------------------------------- | ----------|
0 RXY LAB| A2040 |1
1 RXY LAB| AWAITING COMMODITY QUAL |1
2 RXY LAB| DNR |18
3 RXY LAB| DNR |17
4 QXR LAB| MISSING QUAL INTAKE REQUEST |1
5 RXY LAB| MISSING SKU DOC |2
6 RXY LAB| MISSING SKU DOC |1
7 RXY LAB| MISSING SKU DOC |1
8 RXY LAB| NEED HARDWARE |7
9 VYZ LAB| NEED HARDWARE |4
13 RXY LAB| MISSING SKU DOC |17
14 QXR LAB| MISSING SKU DOC |1
16 RXY LAB| NEED HARDWARE |1
17 RXY LAB| NEED RA |1
24 QXR LAB| NEED HARDWARE |1
最后,我尝试使用其中一个
TagsFinal.groupby(['Lab Location', 'TagSplit'])['WorkItemCnt'].sum()
或
TagsFinal = TagsConcat.groupby(['Lab Location', 'TagSplit']).agg({'WorkItemCnt': ['sum']})
我收到此错误:
KeyError: 'WorkItemCnt'
答案 0 :(得分:1)
我认为您的代码可以简化 - 首先将列 Tags
与 DataFrame.explode
分开,然后按 GroupBy.size
进行聚合计数:
TagsFinal = (df.assign(TagSplit = df['Tags'].str.split('; '))
.explode('TagSplit')
.groupby(['Labs', 'TagSplit'])
.size()
.reset_index(name='WorkItemCnt'))
print (TagsFinal)
Labs TagSplit WorkItemCnt
0 QXR Lab MISSING QUAL INTAKE REQUEST 1
1 QXR Lab MISSING SKU DOC 1
2 QXR Lab NEED HARDWARE 1
3 RXY Lab A2040 1
4 RXY Lab AWAITING COMMODITY QUAL 1
5 RXY Lab DNR 2
6 RXY Lab MISSING SKU DOC 4
7 RXY Lab NEED HARDWARE 2
8 RXY Lab NEED RA 1
9 VYZ Lab NEED HARDWARE 1