我有一个带有重复ID的df,如下所示:
index ID name surname
1 1 a x
2 2 b y
3 1 c z
4 3 d j
我想在右侧添加重复行的列,并删除“单”行,如下所示:
index ID name surname second_name second_surname
1 1 a x c z
最有效的方法是什么? (我有几百万行)
答案 0 :(得分:1)
尝试像这样使用drop_duplicates
,merge
和query
:
df['second_name'] = (df.drop_duplicates(subset='ID')
.reset_index()
.merge(df, on='ID', how='inner', suffixes=('', '_'))
.query("name != name_")
.set_index('level_0')['name_'])
[出]
index ID name second_name
0 1 1 a c
1 2 2 b NaN
2 3 1 c NaN
3 4 3 d NaN
如果只需要单行,请使用dropna
:
df.dropna(subset=['second_name'])
[出]
index ID name second_name
0 1 1 a c
答案 1 :(得分:1)
我的建议涉及groupby,并且应该适用于任意数量的“附加”名称:
df_in = pd.DataFrame({'ID': [1, 2, 1, 3], 'name': ['a', 'b', 'c', 'd']})
grp = df_in.groupby('ID', as_index=True)
df_a = grp.first()
df_b = grp['name'].unique().apply(pd.Series).rename(columns = lambda x: 'name_{:.0f}'.format(x+1)).drop('name_1', axis=1)
df_out = df_a.merge(df_b, how='inner', left_index=True, right_index=True).reset_index(drop=False)
答案 2 :(得分:1)
我将尝试透视数据框。为此,我将首先添加一个等级列以为其ID提供名称的等级:
df['rank'] = df.groupby('ID').cumcount()
pivoted = df.pivot(index='ID', columns='rank', values='name')
给予:
rank 0 1
ID
1 a c
2 b NaN
3 d NaN
让我们格式化它:
pivoted = pivoted.rename_axis(None, axis=1).rename(lambda x: 'name_{}'.format(x),
axis=1).reset_index()
ID name_0 name_1
0 1 a c
1 2 b NaN
2 3 d NaN
答案 3 :(得分:0)
r, i = np.unique(df.ID, return_inverse=True)
j = df.groupby('ID').cumcount()
names = np.empty((len(r), j.max() + 1), object)
names.fill(np.nan)
names[i, j] = df.name
pd.DataFrame(names, r).rename_axis('ID').add_prefix('name_')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN
from itertools import count
from collections import defaultdict
c = defaultdict(count)
d = defaultdict(dict)
for i, n in zip(df.ID, df.name):
d[f'name_{next(c[i])}'][i] = n
pd.DataFrame(d).rename_axis('ID')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN