将重复的行转换为独立的列

时间:2019-05-23 09:33:26

标签: python pandas

我有一个如下数据框:

ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX
2,ABCVRXJ,1029,1249,PackC,32,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX
4,XUVZ200,1030,12421,PackD,33,PACKD-XXXX

我希望最终的数据框看起来像这样:

ID,CUSTOMER_ID,ACC_NUMBER,TRANSACTION_ID,PACK_DESC,PACK_VALIDITY,PACK_NUMBER_1,PACK_NUMBER_2
1,ABCVRXJ,1027,1248,PackA,30,PACKA-XXXX,PACKC-XXXX
3,XUVZ200,1028,12491,PackB,31,PACKB-XXXX,PACKD-XXXX

每个选择了2个包装的CUSTOMER_ID应该转换为一行,并且两个PACK_NUMBER都是2列。

我尝试过:

df['index'] = df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER').rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1))
df_vchrNumber = df_vchrNumber.fillna('').reset_index()

但这会返回

CUSTOMER_ID,PACK_NUMBER_1,PACK_NUMBER_2
0123456789,PACKA-XXXX,PACKC-XXXX
9876543210,PACKB-XXXX,PACKD-XXXX

**但这不是预期的输出,因为我不确定如何添加其他列**

有人愿意帮我一点忙吗?

2 个答案:

答案 0 :(得分:0)

groupbyagg一起使用以选择组的first行。然后再次进行分组,并获得最后一行,最后将两个数据帧合并在一起,以获得所需的输出:

a = df.groupby('CUSTOMER_ID', as_index=False).agg('first')

b = df.groupby('CUSTOMER_ID', as_index=False).agg({'PACK_NUMBER':'last'})

df_final = a.merge(b, on='CUSTOMER_ID', suffixes=['_1', '_2'])


  CUSTOMER_ID  ID  ACC_NUMBER  TRANSACTION_ID PACK_DESC  PACK_VALIDITY PACK_NUMBER_1 PACK_NUMBER_2
0     ABCVRXJ   1        1027            1248     PackA             30    PACKA-XXXX    PACKC-XXXX
1     XUVZ200   3        1028           12491     PackB             31    PACKB-XXXX    PACKD-XXXX

答案 1 :(得分:0)

如果仅需要PACK_NUMBER的第一个和最后一个值,请使用DataFrame.drop_duplicates作为每组的第一个值,并将PACK_NUMBER的最后一个值用于每组:

s = (df.drop_duplicates('CUSTOMER_ID', keep='last')
       .set_index('CUSTOMER_ID')['PACK_NUMBER']
       .rename('PACK_NUMBER_2'))
df = (df.drop_duplicates('CUSTOMER_ID')
        .rename(columns={'PACK_NUMBER':'PACK_NUMBER_1'})
        .join(s, on='CUSTOMER_ID'))
print (df)
   ID CUSTOMER_ID  ACC_NUMBER  TRANSACTION_ID PACK_DESC  PACK_VALIDITY  \
0   1     ABCVRXJ        1027            1248     PackA             30   
2   3     XUVZ200        1028           12491     PackB             31   

  PACK_NUMBER_1 PACK_NUMBER_2  
0    PACKA-XXXX    PACKC-XXXX  
2    PACKB-XXXX    PACKD-XXXX 

您的解决方案应更改为删除重复项并加入Series

df['index']=  df.groupby('CUSTOMER_ID').cumcount()
df_vchrNumber = (df.pivot(index='CUSTOMER_ID', columns='index', values='PACK_NUMBER')
                   .rename(columns=lambda x: 'PACK_NUMBER_'+str(x + 1)))

df=df.drop_duplicates('CUSTOMER_ID').drop('PACK_NUMBER',1).join(df_vchrNumber,on='CUSTOMER_ID')

如果需要处理所有列:

df['index']=  df.groupby('CUSTOMER_ID').cumcount() + 1
df = df.set_index(['CUSTOMER_ID', 'index']).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
  CUSTOMER_ID  ID_1  ID_2  ACC_NUMBER_1  ACC_NUMBER_2  TRANSACTION_ID_1  \
0     ABCVRXJ     1     2          1027          1029              1248   
1     XUVZ200     3     4          1028          1030             12491   

   TRANSACTION_ID_2 PACK_DESC_1 PACK_DESC_2  PACK_VALIDITY_1  PACK_VALIDITY_2  \
0              1249       PackA       PackC               30               32   
1             12421       PackB       PackD               31               33   

  PACK_NUMBER_1 PACK_NUMBER_2  
0    PACKA-XXXX    PACKC-XXXX  
1    PACKB-XXXX    PACKD-XXXX