我有问题。我想从另一个创建一个新的数据帧。我想避免重复的行。这意味着如果有相同的邮件,我应该将它们并排连接,否则顶部和底部。但问题是我每次都会得到价值索引错误。
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
这就是我所做的:
if not self.data.empty:
if data_frame_['Email'][0] in self.data['Email'].get_values():
self.data = pd.concat([self.data, data_frame_], axis=1)
else:
self.data = pd.concat([self.data,data_frame_], axis=0)
else:
self.data = data_frame_.copy()
end = time.time()
data_frame_只有一行,这就是我使用
的原因data_frame_['Email'][0]
数据的实例(在data_frame_中):
Email Project1 Target1 Projetc2 Target2
-------------------------------------------------------------
kml@mail.com 1 5000 NaN NaN
abc@abc.com 7 5000 NaN NaN
kml@mail.com 7 4000 NaN NaN
我的愿望是:
Email Project1 Target1 Projetc2 Target2
-------------------------------------------------------------
kml@mail.com 1 5000 7 4000
abc@abc.com 7 5000 NaN NaN
Ps:我可以使用dicts来做,但为了保护代码完整性,我想使用数据帧。
提前谢谢。
答案 0 :(得分:1)
您可以使用pivot_table
,但首先按cumcount
创建群组:
#rename columns
df.rename(columns={'Project1':'Project','Target1':'Target'}, inplace=True)
print (df)
Email Project Target
0 kml@mail.com 1 5000
1 abc@abc.com 7 5000
2 kml@mail.com 7 4000
df['g'] = (df.groupby('Email').cumcount() + 1).astype(str)
df1 = df.pivot_table(index='Email', columns='g', values=['Project', 'Target'])
#Sort multiindex in columns
df1 = df1.sort_index(axis=1, level=1)
#'reset' multiindex in columns
df1.columns = [''.join(col) for col in df1.columns]
print (df1)
Project1 Target1 Project2 Target2
Email
abc@abc.com 7.0 5000.0 NaN NaN
kml@mail.com 1.0 5000.0 7.0 4000.0