我有一个如下数据框:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX
2,ABCVRXJ,1029,1249,PackC,32,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX
4,XUVZ200,1030,12421,PackD,33,PACKD-XXXX
我希望最终的数据框看起来像这样:
ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER_1,PACK_NUMBER_2
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX,PACKD-XXXX
每个选择了2个包装的CUSTOMER_ID应该转换为一行,并且两个PACK_NUMBER都是2列。
我尝试过:
df['index'] = df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER').rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1))
df_vchrNumber = df_vchrNumber.fillna('').reset_index()
但这会返回
CUSTOMER_ID,PACK_NUMBER_1,PACK_NUMBER_2
0123456789,PACKA-XXXX,PACKC-XXXX
9876543210,PACKB-XXXX,PACKD-XXXX
**但这不是预期的输出,因为我不确定如何添加其他列**
有人愿意帮我一点忙吗?
答案 0 :(得分:0)
将groupby
与agg
一起使用以选择组的first
行。然后再次进行分组,并获得最后一行,最后将两个数据帧合并在一起,以获得所需的输出:
a = df.groupby('CUSTOMER_ID', as_index=False).agg('first')
b = df.groupby('CUSTOMER_ID', as_index=False).agg({'PACK_NUMBER':'last'})
df_final = a.merge(b, on='CUSTOMER_ID', suffixes=['_1', '_2'])
CUSTOMER_ID ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY PACK_NUMBER_1 PACK_NUMBER_2
0 ABCVRXJ 1 1027 1248 PackA 30 PACKA-XXXX PACKC-XXXX
1 XUVZ200 3 1028 12491 PackB 31 PACKB-XXXX PACKD-XXXX
答案 1 :(得分:0)
如果仅需要PACK_NUMBER
的第一个和最后一个值,请使用DataFrame.drop_duplicates
作为每组的第一个值,并将PACK_NUMBER
的最后一个值用于每组:
s = (df.drop_duplicates('CUSTOMER_ID', keep='last')
.set_index('CUSTOMER_ID')['PACK_NUMBER']
.rename('PACK_NUMBER_2'))
df = (df.drop_duplicates('CUSTOMER_ID')
.rename(columns={'PACK_NUMBER':'PACK_NUMBER_1'})
.join(s, on='CUSTOMER_ID'))
print (df)
ID CUSTOMER_ID ACC_NUMBER TRANSACTION_ID PACK_DESC PACK_VALIDITY \
0 1 ABCVRXJ 1027 1248 PackA 30
2 3 XUVZ200 1028 12491 PackB 31
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
2 PACKB-XXXX PACKD-XXXX
您的解决方案应更改为删除重复项并加入Series
:
df['index']= df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = (df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER')
.rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1)))
df=df.drop_duplicates('CUSTOMER_ID').drop('PACK_NUMBER',1).join(df_vchrNumber,on='CUSTOMER_ID')
如果需要处理所有列:
df['index']= df.groupby('CUSTOMER_ID').cumcount() + 1
df = df.set_index(['CUSTOMER_ID', 'index']).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
CUSTOMER_ID ID_1 ID_2 ACC_NUMBER_1 ACC_NUMBER_2 TRANSACTION_ID_1 \
0 ABCVRXJ 1 2 1027 1029 1248
1 XUVZ200 3 4 1028 1030 12491
TRANSACTION_ID_2 PACK_DESC_1 PACK_DESC_2 PACK_VALIDITY_1 PACK_VALIDITY_2 \
0 1249 PackA PackC 30 32
1 12421 PackB PackD 31 33
PACK_NUMBER_1 PACK_NUMBER_2
0 PACKA-XXXX PACKC-XXXX
1 PACKB-XXXX PACKD-XXXX