在pandas中只删除连续重复项的最有效方法是什么?
drop_duplicates给出了这个:
In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [4]: a.drop_duplicates()
Out[4]:
1 1
2 2
4 3
dtype: int64
但我想要这个:
In [4]: a.something()
Out[4]:
1 1
2 2
4 3
5 2
dtype: int64
答案 0 :(得分:62)
使用shift
:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
所以上面使用boolean critieria,我们将数据帧与移位-1行的数据帧进行比较以创建掩码
另一种方法是使用diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
但如果您有大量行,这比原始方法慢。
<强>更新强>
感谢Bjarke Ebert指出一个微妙的错误,我实际应该使用shift(1)
或shift()
,因为默认值是1的句点,这将返回第一个连续的值:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
注意索引值的差异,谢谢@BjarkeEbert!
答案 1 :(得分:7)
这是一个更新,可以使它适用于多个列。使用“.any(axis = 1)”组合每列的结果:
cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]
答案 2 :(得分:4)
由于我们要追求most efficient way
(即性能),所以让我们使用数组数据来利用NumPy。我们将对一次性切片进行切片并进行比较,类似于@EdChum's post
中前面讨论的移位方法。但是,使用NumPy切片后,我们将得到一个较少的数组,因此我们需要在开始选择第一个元素时与一个True
元素连接,因此我们将有一个实现-
def drop_consecutive_duplicates(a):
ar = a.values
return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]
样品运行-
In [149]: a
Out[149]:
1 1
2 2
3 2
4 3
5 2
dtype: int64
In [150]: drop_consecutive_duplicates(a)
Out[150]:
1 1
2 2
4 3
5 2
dtype: int64
大型数组的时间比较@EdChum's solution
-
In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))
In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop
In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop
In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop
In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop
所以,有一些改进!
仅大幅提升价值!
如果只需要这些值,我们可以通过简单地索引数组数据来获得很大的提升,就像这样-
def drop_consecutive_duplicates(a):
ar = a.values
return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]
样品运行-
In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])
时间-
In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop
In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop
答案 3 :(得分:2)
这是同时处理pd.Series
和pd.Dataframes
的函数。您可以遮罩/放置,选择轴,最后选择以“任何”或“全部”“ NaN”放置。它没有在计算时间方面进行优化,但是具有健壮和清晰的优势。
import numpy as np
import pandas as pd
# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=True,
axis=0, how='all'):
'''
#Function built with the help of:
# 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
# 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
Input:
df should be a pandas.DataFrame of a a pandas.Series
Output:
df of ts with masked or droped values
'''
# Mask keeping the first occurence
if keep_first:
df = df.mask(df.shift(1) == df)
# Mask including the first occurence
else:
df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))
# Drop the values (e.g. rows are deleted)
if drop:
return df.dropna(axis=axis, how=how)
# Only mask the values (e.g. become 'NaN')
else:
return df
以下是要包含在脚本中的测试代码:
if __name__ == "__main__":
# With time series
print("With time series:\n")
ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')],
index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
print("#Original ts:")
print(ts)
print("\n## 1) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False,
keep_first=True))
print("\n## 2) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False,
keep_first=False))
print("\n## 3) Drop keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True,
keep_first=True))
print("\n## 4) Drop including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True,
keep_first=False))
# With dataframes
print("With dataframe:\n")
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:9,0]=40
df.iloc[8:15,1]=22
df.iloc[8:12,2]=0.23
print("#Original df:")
print(df)
print("\n## 5) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=True))
print("\n## 6) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False,
keep_first=False))
print("\n## 7) Drop 'any' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=True,
how='any'))
print("\n## 8) Drop 'all' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=True,
how='all'))
print("\n## 9) Drop 'any' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=False,
how='any'))
print("\n## 10) Drop 'all' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True,
keep_first=False,
how='all'))
这是预期的结果:
With time series:
#Original ts:
0 1.0
1 1.0
2 2.0
3 2.0
4 3.0
5 2.0
6 6.0
7 6.0
8 NaN
9 6.0
10 6.0
11 NaN
12 NaN
dtype: float64
## 1) Mask keeping the first occurence:
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 2.0
6 6.0
7 NaN
8 NaN
9 6.0
10 NaN
11 NaN
12 NaN
dtype: float64
## 2) Mask including the first occurence:
0 NaN
1 NaN
2 NaN
3 NaN
4 3.0
5 2.0
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
dtype: float64
## 3) Drop keeping the first occurence:
0 1.0
2 2.0
4 3.0
5 2.0
6 6.0
9 6.0
dtype: float64
## 4) Drop including the first occurence:
4 3.0
5 2.0
dtype: float64
With dataframe:
#Original df:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 40.000000 -0.470958 -0.339213
6 40.000000 1.613524 0.271641
7 40.000000 -1.810958 -1.568372
8 40.000000 22.000000 0.230000
9 -0.296557 22.000000 0.230000
10 -0.921238 22.000000 0.230000
11 -0.170195 22.000000 0.230000
12 1.460457 22.000000 -0.295418
13 0.307825 22.000000 -0.759131
14 0.287392 22.000000 0.378315
## 5) Mask keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 6) Mask including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN NaN NaN
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 7) Drop 'any' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
## 8) Drop 'all' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 9) Drop 'any' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
## 10) Drop 'all' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
答案 4 :(得分:0)
对于其他Stack资源管理器,请构建以上johnml1135的答案。这将从多个列中删除下一个重复项,但不会删除所有列。数据框排序后,即使“ cols”匹配,它也会保留第一行,但保留第二行,即使有更多列的信息不匹配也是如此。
cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]
答案 5 :(得分:0)
另一种实现方式:
a.loc[a.ne(a.shift())]
方法pandas.Series.ne
是不等于运算符,因此a.ne(a.shift())
等效于a != a.shift()
。文档here。
答案 6 :(得分:0)
这是 EdChum's answer 的一个变体,它也将连续的 NaN 视为重复项:
def remove_consecutive_duplicates_and_nans(s):
# By default, `shift` uses NaN as a fill value, which breaks our
# removal of consecutive NaNs. Hence we use a different sentinel
# object instead.
shifted = s.astype(object).shift(-1, fill_value=object())
return s.loc[
(shifted != s)
& ~(shifted.isna() & s.isna())
]
答案 7 :(得分:0)
创建新列。
df['match'] = df.col1.eq(df.col1.shift())
那么:
df = df[df['match']==False]