假设我们的问题可以像这样简化:
df = pd.DataFrame()
df['C_rows'] = ['C1', 'C2', 'C3', 'C2', 'C1', 'C2', 'C3', 'C1', 'C2', 'C3', 'C4', 'C1']
df['values'] = ['customer1', 4321, 1266, 5671, 'customer2', 123, 7344,'customer3', 4321, 4444, 5674, 'customer4']
表:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
4 C1 customer2
5 C2 123
6 C3 7344
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
11 C1 customer4
我们如何在每个C_rows
之间进行矢量化,找到重复的C1
,
即row3
在第1行和第3行中出现重复的C2
。
我正在使用的数据集有50,000行,每个C1
之间约有15行。
e.g。检查重复项如下:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
C2重复
4 C1 customer2
5 C2 123
6 C3 7344
没有重复
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
没有重复
不使用for循环 - 快速(矢量化)。
答案 0 :(得分:3)
似乎apply
+ duplicated
(df.groupby(df.C_rows.eq('C1').cumsum()).C_rows.apply(pd.Series.duplicated)
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
Name: C_rows, dtype: bool
)会这样做。
df
使用掩码过滤掉SEVERE: Exception sending context initialized event to listener instance of class org.springframework.web.context.ContextLoaderListener
org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'controllerDependencyBean': Unsatisfied dependency expressed through field 'signUpService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'signUpServiceImpl': Unsatisfied dependency expressed through field 'notifyRepository'; nested exception is org.springframework.beans.factory.BeanCreationException.
。
答案 1 :(得分:3)
对于非常快速的矢量化解决方案,在$("#logo-img").hover(
function(){
$("#logo-img").attr("src","/images/img1.gif");
}, function() {
$("#logo-img").attr("src","/images/img2.gif");
});
之间按连续值创建新的clumn,然后选中duplicated
:
C1
如果需要过滤器:
df['dupe'] = df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])
print (df)
C_rows values dupe
0 C1 customer1 False
1 C2 4321 False
2 C3 1266 False
3 C2 5671 True
4 C1 customer2 False
5 C2 123 False
6 C3 7344 False
7 C1 customer3 False
8 C2 4321 False
9 C3 4444 False
10 C4 5674 False
11 C1 customer4 False
如果要检查重复组:
df = df[df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])]
print (df)
C_rows values
3 C2 5671
答案 2 :(得分:3)
您可以使用转换和重复即
df['g'] = df['values'].astype(str).str.contains('[A-z]').cumsum()
df['is_dup'] = df.groupby('g')['C_rows'].transform(lambda x : x.duplicated().any())
C_rows values g is_dup
0 C1 customer1 1 True
1 C2 4321 1 True
2 C3 1266 1 True
3 C2 5671 1 True
4 C1 customer2 2 False
5 C2 123 2 False
6 C3 7344 2 False
7 C1 customer3 3 False
8 C2 4321 3 False
9 C3 4444 3 False
10 C4 5674 3 False
11 C1 customer4 4 False
如果您只想查找重复的行,请删除any()
df['is_dup'] = df.groupby('g')['C_rows'].transform(lambda x : x.duplicated())