如果Duplicate值位于下一行的不同列中,则从Pandas Dataframe中删除重复项

时间:2017-05-03 13:01:11

标签: python pandas

我有一个大型数据框,其格式如下:

term_x  Intersections   term_y

boxers      1   briefs

briefs      1   boxers

babies      6   costumes

costumes    6   babies

babies     12   clothes

clothes    12   babies

babies      1   clothings

clothings   1   babies

此文件有超过数百万行。我想要做的是削减这些冗余行。有没有办法使用Pandas重复删除功能以快速和Pythonic方式消除这些重复?我目前的方法是在整个数据帧上进行迭代,得到下一行的值,然后删除重复的行,但事实证明这很慢:

row_iterator = duplicate_df_selfmerge.iterrows()
_, next = row_iterator.__next__()  # take first item from row_iterator
for index, row in row_iterator:
        if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']):
            duplicate_df_selfmerge.drop(index, inplace=True)
        next = row

2 个答案:

答案 0 :(得分:1)

您可以将这两列放在一起,对这些对进行排序,然后在这些已排序的对上删除行:

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]

df.drop_duplicates(subset=['together'])
Out[11]: 
   term_x  Intersections     term_y          together
0  boxers              1     briefs     boxers,briefs
2  babies              6   costumes   babies,costumes
4  babies             12    clothes    babies,clothes
6  babies              1  clothings  babies,clothings

编辑:你说时间是这个问题的一个重要因素。以下是一些时序,将我和Allen的解决方案与200,000行的数据帧进行比较:

while df.shape[0] < 200000:
    df.append(df)

%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop

%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop

如您所见,我的方法速度提高了98%以上。 pandas.DataFrame.apply在许多情况下都很慢。

答案 1 :(得分:1)

df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
 'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
  5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
  2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})

#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)

df
Out[916]: 
   Intersections  term_x     term_y                  team_xy
0              1  boxers     briefs     ['boxers', 'briefs']
2              6  babies   costumes   ['babies', 'costumes']
4             12  babies    clothes    ['babies', 'clothes']
6              1  babies  clothings  ['babies', 'clothings']