如何根据另一个重复值将字符串从一行追加到另一行

时间:2019-12-11 17:33:47

标签: python pandas duplicates data-cleaning

我有一个数据集,其中包含多个重复的“电子邮件”字段,我想用作唯一ID。但是,每个重复项都包含有关用户“标签”的唯一信息,我希望在删除之前对其进行编译和保留。

示例:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
               [1234, 'Customer A', np.nan, '333 Street', np.nan],
               [1234, 'Customer A', '12345 Street', np.nan, np.nan],
               [1234, 'Customer A', np.nan, np.nan, np.nan],
               [1233, 'Customer B', '444 Street', '3335 Street', np.nan],
               [1233, 'Customer B', '555 Street', '666 Street', np.nan],
               [1233, 'Customer B', '553 Street', '666 Street', 'abc@email.com'],
               [1235, 'Customer C', '1553 Street', '644 Street', 'abc@email.com'],
               [1235, 'Customer C', '2553 Street', '644 Street', 'abc@email.com']],     
               columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])
df.head()
    ID      Customer    Billing Address Shipping Address     Contact
0   1234    Customer A  123 Street      NaN                  NaN
1   1234    Customer A  NaN             333 Street           NaN
2   1234    Customer A  12345 Street    NaN                  NaN
3   1234    Customer A  NaN             NaN                  NaN
4   1233    Customer B  444 Street      3335 Street          NaN

我想将每个标记为“客户A”的行的Contact信息合并到最后一行,并用,分隔,最终结果将是NaN, NaN, NaN, NaN(或其他任何值)其他字符串数据位于每个字段中,只是合并并由一列隔开。

这是我尝试过的方法,但是必须有一个更优雅的解决方案。 按Email字段排序后:

def row_clean(df):
    for i in range(0, len(df)-1):
        if df.loc[i,'Customer'] == np.NaN:
            return df
        elif df.loc[i,'Customer'] == df.loc[(i+1),'Customer']:
            df.loc[(i+1),'Contact'] = str(df.loc[(i+1),'Contact']) + ', ' + str(df.loc[i,'Contact'])            
    return df

row_clean(df)

这里有什么想法吗?谢谢!

1 个答案:

答案 0 :(得分:0)

这是您想要的吗?

customers=df["Customer"].unique().tolist()
List=[]

for customer in customers: 
    List.append(df.loc[df["Customer"]==customer,"Contact"].tolist())

df=df.drop_duplicates("Customer",keep="first")
df["new"]=List

输出

Out[10]: 
     ID    Customer  ...        Contact                             new
0  1234  Customer A  ...            NaN            [nan, nan, nan, nan]
4  1233  Customer B  ...            NaN       [nan, nan, abc@email.com]
7  1235  Customer C  ...  abc@email.com  [abc@email.com, abc@email.com]

[3 rows x 6 columns]