我有一个重复ID的数据框,但数据在多个区域部分完成。
MESSAGE 'Hello' type status to client
我想保留所有数据,以便在数据存在时创建新列,使其看起来像下面的数据框:
我尝试了以下操作,但它删除了我想保留的数据。
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, '333 Street', np.nan],
[1234, 'Customer A', '12345 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, np.nan, np.nan],
[1233, 'Customer B', '444 Street', '3335 Street', np.nan],
[1233, 'Customer B', '555 Street', '666 Street', np.nan],
[1233, 'Customer B', '553 Street', '666 Street', 'abc@email.com'],
[1235, 'Customer C', '1553 Street', '644 Street', 'abc@email.com'],
[1235, 'Customer C', '2553 Street', '644 Street', 'abc@email.com']],
columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])
df
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
1 1234 Customer A NaN 333 Street NaN
2 1234 Customer A 12345 Street NaN NaN
3 1234 Customer A NaN NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
5 1233 Customer B 555 Street 666 Street NaN
6 1233 Customer B 553 Street 666 Street abc@email.com
7 1235 Customer C 1553 Street 644 Street abc@email.com
8 1235 Customer C 2553 Street 644 Street abc@email.com
编辑:我添加了更多数据,因为原始帖子中不清楚可能存在多行ID。
答案 0 :(得分:3)
以下是一种使用apply
并使用dict
创建pd.Series
创建新列的方法
In [1057]: cols = ['Billing Address', 'Shipping Address']
In [1058]: (df.groupby(['ID', 'Customer'])
.apply(lambda g: pd.Series({'%s %s' % (x, i+1): v[x]
for i, v in enumerate(g[cols].to_dict('r'))
for x in v})))
Out[1058]:
Billing Address 1 Billing Address 2 Shipping Address 1 \
ID Customer
1233 Customer B 444 Street 555 Street 333 Street
1234 Customer A 123 Street NaN NaN
Shipping Address 2
ID Customer
1233 Customer B 666 Street
1234 Customer A 333 Street
答案 1 :(得分:1)
这是一个潜在的解决方案,尽管在该过程中使用的内存方面根本没有效率。
我们的想法是循环使用唯一ID
的行数,并将数据框与第n行合并:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),\
on = 'ID',how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address +=1
ID Customer_1 Billing Address_1 Shipping Address_1 Customer_2 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN Customer A NaN 333 Street
1 1233 Customer B 444 Street 333 Street Customer B 555 Street 666 Street
为了符合您想要的输出,我们需要在['ID','Customer']
上合并,因为它在您的示例中是相同的键:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),on = ['ID','Customer'],how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address+=1
ID Customer Billing Address_1 Shipping Address_1 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN NaN 333 Street
1 1233 Customer B 444 Street 333 Street 555 Street 666 Street