查找仅限于多个范围的重复项 - pandas

时间:2018-01-19 10:29:59

标签: python pandas

假设我们的问题可以像这样简化:

df = pd.DataFrame()
df['C_rows'] = ['C1', 'C2', 'C3', 'C2', 'C1', 'C2', 'C3', 'C1', 'C2', 'C3', 'C4', 'C1']
df['values'] = ['customer1', 4321, 1266, 5671, 'customer2', 123, 7344,'customer3', 4321, 4444, 5674, 'customer4']

表:

    C_rows  values
0   C1      customer1
1   C2      4321
2   C3      1266
3   C2      5671
4   C1      customer2
5   C2      123
6   C3      7344
7   C1      customer3
8   C2      4321
9   C3      4444
10  C4      5674
11  C1      customer4

我们如何在每个C_rows之间进行矢量化,找到重复的C1, 即row3在第1行和第3行中出现重复的C2。 我正在使用的数据集有50,000行,每个C1之间约有15行。

e.g。检查重复项如下:

    C_rows  values
0   C1      customer1
1   C2      4321
2   C3      1266
3   C2      5671

C2重复

4   C1      customer2
5   C2      123
6   C3      7344

没有重复

7   C1      customer3
8   C2      4321
9   C3      4444
10  C4      5674

没有重复

不使用for循环 - 快速(矢量化)。

3 个答案:

答案 0 :(得分:3)

似乎apply + duplicateddf.groupby(df.C_rows.eq('C1').cumsum()).C_rows.apply(pd.Series.duplicated) 0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False Name: C_rows, dtype: bool )会这样做。

df

使用掩码过滤掉SEVERE: Exception sending context initialized event to listener instance of class org.springframework.web.context.ContextLoaderListener org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'controllerDependencyBean': Unsatisfied dependency expressed through field 'signUpService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'signUpServiceImpl': Unsatisfied dependency expressed through field 'notifyRepository'; nested exception is org.springframework.beans.factory.BeanCreationException.

答案 1 :(得分:3)

对于非常快速的矢量化解决方案,在$("#logo-img").hover( function(){ $("#logo-img").attr("src","/images/img1.gif"); }, function() { $("#logo-img").attr("src","/images/img2.gif"); }); 之间按连续值创建新的clumn,然后选中duplicated

C1

如果需要过滤器:

df['dupe'] = df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])
print (df)
   C_rows     values   dupe
0      C1  customer1  False
1      C2       4321  False
2      C3       1266  False
3      C2       5671   True
4      C1  customer2  False
5      C2        123  False
6      C3       7344  False
7      C1  customer3  False
8      C2       4321  False
9      C3       4444  False
10     C4       5674  False
11     C1  customer4  False

如果要检查重复组:

df = df[df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])]
print (df)
  C_rows values
3     C2   5671

答案 2 :(得分:3)

您可以使用转换和重复即

df['g'] = df['values'].astype(str).str.contains('[A-z]').cumsum()
df['is_dup'] = df.groupby('g')['C_rows'].transform(lambda x : x.duplicated().any())

  C_rows     values  g  is_dup
0      C1  customer1  1    True
1      C2       4321  1    True
2      C3       1266  1    True
3      C2       5671  1    True
4      C1  customer2  2   False
5      C2        123  2   False
6      C3       7344  2   False
7      C1  customer3  3   False
8      C2       4321  3   False
9      C3       4444  3   False
10     C4       5674  3   False
11     C1  customer4  4   False

如果您只想查找重复的行,请删除any()

df['is_dup'] = df.groupby('g')['C_rows'].transform(lambda x : x.duplicated())