我的数据框在“lid”列中有重复值。我想使用Pandas删除其“lid”列中的值计数超过2次的行。 这是原始表:
entity pnb head# state lid
ABB001 A03 3 DOWN A
ABB001 A03 3 DOWN A
ABB001 A03 3 DOWN A
ABB002 A02 4 DOWN B
ABB002 A02 4 DOWN B
ABB002 A02 2 DOWN C
ABB002 A02 4 DOWN D
ABB002 A02 4 DOWN E
ABB002 A02 4 DOWN E
ABB002 A02 4 DOWN E
结果如下:
entity pnb head# state lid
ABB002 A02 4 DOWN B
ABB002 A02 4 DOWN B
ABB002 A02 2 DOWN C
ABB002 A02 4 DOWN D
答案 0 :(得分:3)
使用groupby
+ transform
。
df[~df.lid.groupby(df.lid).transform('count').gt(2)]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
transform
为您提供一系列相同大小的计数。
v = df.lid.groupby(df.lid).transform('count')
v
0 3
1 3
2 3
3 2
4 2
5 1
6 1
7 3
8 3
9 3
Name: lid, dtype: int
用它来确定需要去哪些行。
~v.gt(2)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
Name: lid, dtype: bool
使用掩码索引df
。
答案 1 :(得分:3)
选项0
使用value_counts
和isin
df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
选项1
使用np.in1d
和pd.factorize
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
选项2
使用np.bincount
和pd.factorize
f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
有趣的演示,以突出显示@cᴏʟᴅsᴘᴇᴇᴅ和我在评论中谈论的内容。
喜欢bincount。某处也应该有一个非常独特的。 - cᴏʟᴅsᴘᴇᴇᴅ
是的。但是,我不使用np.unique,因为@Jeff告诉我np.unique在你获取计数或索引或反向时排序。 pd.factorize不是,而且是O(n)。我已经验证了这些信息。 - piRSquared
时间测试
def bincount_factorize(df):
f, u = pd.factorize(df.lid.values)
return df[np.bincount(f)[f] <= 2]
def bincount_unique(df):
u, f = np.unique(df.lid.values, return_inverse=True)
return df[np.bincount(f)[f] <= 2]
def in1d_factorize(df):
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
return df[np.in1d(lids, u[np.bincount(f) <= 2])]
def transform(df):
return df[df.groupby('lid')['lid'].transform('size') <= 2]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000,
30000, 100000, 300000, 1000000],
columns=['bincount_factorize', 'bincount_unique',
'in1d_factorize', 'transform'],
dtype=float
)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
res.div(res.min(1), 0)
bincount_factorize bincount_unique in1d_factorize transform
10 1.421827 1.000000 1.119577 3.751167
30 1.008412 1.037297 1.000000 3.072631
100 1.000000 1.531300 1.028267 3.304560
300 1.000000 2.666583 1.182812 3.637235
1000 1.065213 5.563098 1.000000 2.556469
3000 1.024658 10.480027 1.000000 2.238765
10000 1.073403 14.716801 1.000000 1.574780
30000 1.000000 16.387130 1.053180 1.494161
100000 1.000000 18.533078 1.003031 1.369867
300000 1.078129 20.183122 1.000000 1.530698
1000000 1.166800 24.571463 1.000000 1.670423
res.plot(loglog=True)
答案 2 :(得分:2)
df = df[df.groupby('lid')['lid'].transform('size') <= 2]
print (df)
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
详情:
print (df.groupby('lid')['lid'].transform('size'))
0 3
1 3
2 3
3 2
4 2
5 1
6 1
7 3
8 3
9 3
Name: lid, dtype: int64
print (df.groupby('lid')['lid'].transform('size') <= 2)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
Name: lid, dtype: bool
使用filter的另一个更慢的解决方案:
df = df.groupby('lid').filter(lambda x: len(x) <= 2)
print (df)
entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D
<强>计时强>:
#jez1
In [34]: %timeit (df[df.groupby('lid')['lid'].transform('size') <= 2000])
10 loops, best of 3: 57.8 ms per loop
#jez2
In [35]: %timeit df.groupby('lid').filter(lambda x: len(x) <= 2000)
10 loops, best of 3: 124 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ
In [36]: %timeit (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
10 loops, best of 3: 93.6 ms per loop
#pir1
In [37]: %timeit (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
10 loops, best of 3: 137 ms per loop
#pir2
In [38]: %timeit (pir(df))
10 loops, best of 3: 32.9 ms per loop
<强>设置强>:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'lid': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','lid']).reset_index(drop=True)
#print (df)
print (df[~df.lid.groupby(df.lid).transform('count').gt(2000)])
print (df[df.groupby('lid')['lid'].transform('size') <= 2000])
print (df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2000].index)])
def pir(df):
f, u = pd.factorize(df.lid)
return df[np.bincount(f)[f] <= 2000]
print (pir(df))
警告
考虑到组的数量,结果不能解决性能问题,这会对某些解决方案的时间产生很大的影响。