我有一个数据集,其中包含多个重复的“电子邮件”字段,我想用作唯一ID。但是,每个重复项都包含有关用户“标签”的唯一信息,我希望在删除之前对其进行编译和保留。
示例:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, '333 Street', np.nan],
[1234, 'Customer A', '12345 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, np.nan, np.nan],
[1233, 'Customer B', '444 Street', '3335 Street', np.nan],
[1233, 'Customer B', '555 Street', '666 Street', np.nan],
[1233, 'Customer B', '553 Street', '666 Street', 'abc@email.com'],
[1235, 'Customer C', '1553 Street', '644 Street', 'abc@email.com'],
[1235, 'Customer C', '2553 Street', '644 Street', 'abc@email.com']],
columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])
df.head()
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
1 1234 Customer A NaN 333 Street NaN
2 1234 Customer A 12345 Street NaN NaN
3 1234 Customer A NaN NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
我想将每个标记为“客户A”的行的Contact
信息合并到最后一行,并用,
分隔,最终结果将是NaN, NaN, NaN, NaN
(或其他任何值)其他字符串数据位于每个字段中,只是合并并由一列隔开。
这是我尝试过的方法,但是必须有一个更优雅的解决方案。
按Email
字段排序后:
def row_clean(df):
for i in range(0, len(df)-1):
if df.loc[i,'Customer'] == np.NaN:
return df
elif df.loc[i,'Customer'] == df.loc[(i+1),'Customer']:
df.loc[(i+1),'Contact'] = str(df.loc[(i+1),'Contact']) + ', ' + str(df.loc[i,'Contact'])
return df
row_clean(df)
这里有什么想法吗?谢谢!
答案 0 :(得分:0)
这是您想要的吗?
customers=df["Customer"].unique().tolist()
List=[]
for customer in customers:
List.append(df.loc[df["Customer"]==customer,"Contact"].tolist())
df=df.drop_duplicates("Customer",keep="first")
df["new"]=List
输出
Out[10]:
ID Customer ... Contact new
0 1234 Customer A ... NaN [nan, nan, nan, nan]
4 1233 Customer B ... NaN [nan, nan, abc@email.com]
7 1235 Customer C ... abc@email.com [abc@email.com, abc@email.com]
[3 rows x 6 columns]