删除出现次数超过N次的重复值

时间:2018-01-16 07:12:10

标签: python pandas dataframe duplicates

我的数据框在“lid”列中有重复值。我想使用Pandas删除其“lid”列中的值计数超过2次的行。 这是原始表:

entity  pnb head#   state   lid
ABB001  A03 3   DOWN    A
ABB001  A03 3   DOWN    A
ABB001  A03 3   DOWN    A
ABB002  A02 4   DOWN    B
ABB002  A02 4   DOWN    B
ABB002  A02 2   DOWN    C
ABB002  A02 4   DOWN    D
ABB002  A02 4   DOWN    E
ABB002  A02 4   DOWN    E
ABB002  A02 4   DOWN    E

结果如下:

entity  pnb head#   state   lid
ABB002  A02 4   DOWN    B
ABB002  A02 4   DOWN    B
ABB002  A02 2   DOWN    C
ABB002  A02 4   DOWN    D

3 个答案:

答案 0 :(得分:3)

使用groupby + transform

df[~df.lid.groupby(df.lid).transform('count').gt(2)]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

transform为您提供一系列相同大小的计数。

v = df.lid.groupby(df.lid).transform('count')
v

0    3
1    3
2    3
3    2
4    2
5    1
6    1
7    3
8    3
9    3
Name: lid, dtype: int

用它来确定需要去哪些行。

~v.gt(2)

0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: lid, dtype: bool

使用掩码索引df

答案 1 :(得分:3)

选项0
使用value_countsisin

df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项1
使用np.in1dpd.factorize

更好地实施
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

选项2
使用np.bincountpd.factorize

f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]

   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

有趣的演示,以突出显示@cᴏʟᴅsᴘᴇᴇᴅ和我在评论中谈论的内容。

  

喜欢bincount。某处也应该有一个非常独特的。 - cᴏʟᴅsᴘᴇᴇᴅ

     

是的。但是,我不使用np.unique,因为@Jeff告诉我np.unique在你获取计数或索引或反向时排序。 pd.factorize不是,而且是O(n)。我已经验证了这些信息。 - piRSquared

时间测试

def bincount_factorize(df):
    f, u = pd.factorize(df.lid.values)
    return df[np.bincount(f)[f] <= 2]

def bincount_unique(df):
    u, f = np.unique(df.lid.values, return_inverse=True)
    return df[np.bincount(f)[f] <= 2]

def in1d_factorize(df):
    lids = df.lid.values
    f, u = pd.factorize(df.lid.values)
    return df[np.in1d(lids, u[np.bincount(f) <= 2])]

def transform(df):
    return df[df.groupby('lid')['lid'].transform('size') <= 2]

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000,
           30000, 100000, 300000, 1000000],
    columns=['bincount_factorize', 'bincount_unique',
             'in1d_factorize', 'transform'],
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import d, {j}'
        res.at[i, j] = timeit(stmt, setp, number=100)

res.div(res.min(1), 0)

         bincount_factorize  bincount_unique  in1d_factorize  transform
10                 1.421827         1.000000        1.119577   3.751167
30                 1.008412         1.037297        1.000000   3.072631
100                1.000000         1.531300        1.028267   3.304560
300                1.000000         2.666583        1.182812   3.637235
1000               1.065213         5.563098        1.000000   2.556469
3000               1.024658        10.480027        1.000000   2.238765
10000              1.073403        14.716801        1.000000   1.574780
30000              1.000000        16.387130        1.053180   1.494161
100000             1.000000        18.533078        1.003031   1.369867
300000             1.078129        20.183122        1.000000   1.530698
1000000            1.166800        24.571463        1.000000   1.670423
res.plot(loglog=True)

enter image description here

答案 2 :(得分:2)

transform使用boolean indexing

df = df[df.groupby('lid')['lid'].transform('size') <= 2]

print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

详情:

print (df.groupby('lid')['lid'].transform('size'))
0    3
1    3
2    3
3    2
4    2
5    1
6    1
7    3
8    3
9    3
Name: lid, dtype: int64

print (df.groupby('lid')['lid'].transform('size') <= 2)
0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: lid, dtype: bool

使用filter的另一个更慢的解决方案:

df = df.groupby('lid').filter(lambda x: len(x) <= 2)
print (df)
   entity  pnb  head# state lid
3  ABB002  A02      4  DOWN   B
4  ABB002  A02      4  DOWN   B
5  ABB002  A02      2  DOWN   C
6  ABB002  A02      4  DOWN   D

<强>计时

#jez1
In [34]: %timeit (df[df.groupby('lid')['lid'].transform('size') <= 2000])
10 loops, best of 3: 57.8 ms per loop

#jez2
In [35]: %timeit df.groupby('lid').filter(lambda x: len(x) <= 2000)
10 loops, best of 3: 124 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ
In [36]: %timeit (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
10 loops, best of 3: 93.6 ms per loop

#pir1
In [37]: %timeit (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
10 loops, best of 3: 137 ms per loop

#pir2
In [38]: %timeit (pir(df))
10 loops, best of 3: 32.9 ms per loop

<强>设置

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'lid': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','lid']).reset_index(drop=True)
#print (df)


print (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
print (df[df.groupby('lid')['lid'].transform('size') <= 2000])
print (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])


def pir(df):
    f, u = pd.factorize(df.lid)
    return df[np.bincount(f)[f] <= 2000]

print (pir(df))

警告

考虑到组的数量,结果不能解决性能问题,这会对某些解决方案的时间产生很大的影响。