检测DataFrame&中某些列中的重复项。对这些进行操作

时间:2017-03-13 08:21:49

标签: python pandas numpy dataframe

继续这个问题,这是我想要的输入/输出。我有一些想法虽然不完全确定..

How do i detect duplicates and then among them cross check if two columns have similar values?

所以我有一个像这样的数据框。

 No  fname        sname        landline        address     time_of_move_in
 1   Alphred      Thomas         123              A        19/10/2016,00:01:00
 2   Peter        Jay            345              B        29/10/2016,00:01:00
 3   Donald       Hook           123              A        30/10/2016,00:11:00
 4   Jay          Donald         345              B        29/10/2016,00:05:00
 5   Jay          Donald         123              A        30/10/2016,00:14:00
 6   Haskell      Peter          123              B        19/10/2016,00:01:00

我想要的是像这样的输出

 Case_Number   fname    sname    landline   address   time_diff
      1        Peter     Jay       345         B       -4 Hours
      1        Jay       Donald    345         B       4 Hours
      2        Donald    Hook      123         A       -2 Hours
      2        Jay       Donald    123         A       2 Hours

最终我只想过滤掉所发现的两者之间的时间差异为<1的情况。 3小时。

在检测到的任何两个案例之间的标准

  1. 固定电话和地址应相同

  2. 如果满足以上条件,则必须在检测到的两行之间的fname或surname中重复相同的名称。 (如果上面的情况是杰伊,在上面的案例2中是唐纳德。请注意,如果唐纳德在fname中重复两次,那么这不是一个有效的案例)

  3. 两个&lt; 3小时之间的时差,我想在这里带来时间的方向性,最终因此在上面的输出集中带来负面影响。

  4. 注意:我们不必以上述格式显示时差。只要它的某些数字/时间格式很好

1 个答案:

答案 0 :(得分:1)

您可以将timedelta转换为total_seconds因为使用timedelta < 0时有点复杂:

df.time_of_move_in = pd.to_datetime(df.time_of_move_in, format='%d/%m/%Y,%H:%M:%S')
print (df)
   No    fname   sname  landline address     time_of_move_in
0   1  Alphred  Thomas       123       A 2016-10-19 00:01:00
1   2    Peter     Jay       345       B 2016-10-29 00:01:00
2   3   Donald    Hook       123       A 2016-10-30 00:11:00
3   4      Jay  Donald       345       B 2016-10-29 00:05:00
4   5      Jay  Donald       123       A 2016-10-30 00:14:00
5   6  Haskell   Peter       123       B 2016-10-19 00:01:00

def f(x):
    #convert 4 hours to seconds  
    hours4 = 4 * 60 * 60
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    #create unique values from x.name, insert as first column
    x1.insert(0,'Case_number', '{}{}'.format(*x.name))
    #get difference of datetimes, first value is NaN
    x1['time_diff'] = x1.time_of_move_in.diff().dt.total_seconds() 
    #get inverse difference, last value is NaN so filna NaN by value
    x1['time_diff']=x1['time_diff'].fillna(x1.time_of_move_in.diff(-1).dt.total_seconds())
    #boolean indexing
    x1 = x1[(x1['time_diff'] < hours4) & (x1['time_diff']  > -hours4)]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
#factorize values, add 1 for start from 1
df2.Case_number = pd.factorize(df2.Case_number)[0] + 1
df2.drop(['time_of_move_in', 'No'], axis=1, inplace=True)
print (df2)
   Case_number   fname   sname  landline address  time_diff
0            1  Donald    Hook       123       A     -180.0
1            1     Jay  Donald       123       A      180.0
2            2   Peter     Jay       345       B     -240.0
3            2     Jay  Donald       345       B      240.0