Question

我有一个具有多列和多行的pandas数据框。我希望在特定的列中找到连续的重复值，并删除该重复值首次出现的整个行。

我找到了可能的解决方案，但仅适用于熊猫系列。 a.loc[a.shift() != a] This is the link to the mentioned solution

可视化我的数据框将是这样的：

Index column0 column1 column2 column3
row0 0.5 25 26 27
row1 0.5 30 31 32
row2 1.0 35 36 37
row3 1.5 40 41 42

Index column0 column1 column2 column3
row1 0.5 30 31 32
row2 1.0 35 36 37
row3 1.5 40 41 42

这是删除 row0 的预期结果。

P.S这种重复出现不是在我的数据开始时发生，而是随机出现在 column0 中。

Answer 1

df.loc[df.iloc[:, 0].shift(-1) != df.iloc[:, 0]]

这就是答案！谢谢Quang Hoang!

Answer 2

这里是逐步解决方案。

import pandas as pd
import numpy as np    

df = pd.DataFrame(np.random.randint(0,7,size=(10, 4)), columns=list('ABCD'))    

number_of_occurrence_on_first_column = df.groupby('A')['A'].count()    

has_duplicates_items = number_of_occurrence_on_first_column[number_of_occurrence_on_first_column >1].index    

all_duplicate_items = df[df.A.isin(has_duplicates_items)]    

need_to_delete = pd.DataFrame(all_duplicate_items['A']).drop_duplicates().index
df = df.drop(need_to_delete)

如何在Pandas数据框列中查找第一个连续值并删除该行？

2 个答案: