我有一个大型数据框,其格式如下:
term_x Intersections term_y
boxers 1 briefs
briefs 1 boxers
babies 6 costumes
costumes 6 babies
babies 12 clothes
clothes 12 babies
babies 1 clothings
clothings 1 babies
此文件有超过数百万行。我想要做的是削减这些冗余行。有没有办法使用Pandas重复删除功能以快速和Pythonic方式消除这些重复?我目前的方法是在整个数据帧上进行迭代,得到下一行的值,然后删除重复的行,但事实证明这很慢:
row_iterator = duplicate_df_selfmerge.iterrows()
_, next = row_iterator.__next__() # take first item from row_iterator
for index, row in row_iterator:
if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']):
duplicate_df_selfmerge.drop(index, inplace=True)
next = row
答案 0 :(得分:1)
您可以将这两列放在一起,对这些对进行排序,然后在这些已排序的对上删除行:
df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
df.drop_duplicates(subset=['together'])
Out[11]:
term_x Intersections term_y together
0 boxers 1 briefs boxers,briefs
2 babies 6 costumes babies,costumes
4 babies 12 clothes babies,clothes
6 babies 1 clothings babies,clothings
编辑:你说时间是这个问题的一个重要因素。以下是一些时序,将我和Allen的解决方案与200,000行的数据帧进行比较:
while df.shape[0] < 200000:
df.append(df)
%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop
%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop
如您所见,我的方法速度提高了98%以上。 pandas.DataFrame.apply
在许多情况下都很慢。
答案 1 :(得分:1)
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})
#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)
df
Out[916]:
Intersections term_x term_y team_xy
0 1 boxers briefs ['boxers', 'briefs']
2 6 babies costumes ['babies', 'costumes']
4 12 babies clothes ['babies', 'clothes']
6 1 babies clothings ['babies', 'clothings']