继续这个问题,这是我想要的输入/输出。我有一些想法虽然不完全确定..
How do i detect duplicates and then among them cross check if two columns have similar values?
所以我有一个像这样的数据框。
No fname sname landline address time_of_move_in
1 Alphred Thomas 123 A 19/10/2016,00:01:00
2 Peter Jay 345 B 29/10/2016,00:01:00
3 Donald Hook 123 A 30/10/2016,00:11:00
4 Jay Donald 345 B 29/10/2016,00:05:00
5 Jay Donald 123 A 30/10/2016,00:14:00
6 Haskell Peter 123 B 19/10/2016,00:01:00
我想要的是像这样的输出
Case_Number fname sname landline address time_diff
1 Peter Jay 345 B -4 Hours
1 Jay Donald 345 B 4 Hours
2 Donald Hook 123 A -2 Hours
2 Jay Donald 123 A 2 Hours
最终我只想过滤掉所发现的两者之间的时间差异为<1的情况。 3小时。
在检测到的任何两个案例之间的标准
固定电话和地址应相同
如果满足以上条件,则必须在检测到的两行之间的fname或surname中重复相同的名称。 (如果上面的情况是杰伊,在上面的案例2中是唐纳德。请注意,如果唐纳德在fname中重复两次,那么这不是一个有效的案例)
两个&lt; 3小时之间的时差,我想在这里带来时间的方向性,最终因此在上面的输出集中带来负面影响。
注意:我们不必以上述格式显示时差。只要它的某些数字/时间格式很好
答案 0 :(得分:1)
您可以将timedelta转换为total_seconds
因为使用timedelta < 0
时有点复杂:
df.time_of_move_in = pd.to_datetime(df.time_of_move_in, format='%d/%m/%Y,%H:%M:%S')
print (df)
No fname sname landline address time_of_move_in
0 1 Alphred Thomas 123 A 2016-10-19 00:01:00
1 2 Peter Jay 345 B 2016-10-29 00:01:00
2 3 Donald Hook 123 A 2016-10-30 00:11:00
3 4 Jay Donald 345 B 2016-10-29 00:05:00
4 5 Jay Donald 123 A 2016-10-30 00:14:00
5 6 Haskell Peter 123 B 2016-10-19 00:01:00
def f(x):
#convert 4 hours to seconds
hours4 = 4 * 60 * 60
mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
x1 = x[mask]
#create unique values from x.name, insert as first column
x1.insert(0,'Case_number', '{}{}'.format(*x.name))
#get difference of datetimes, first value is NaN
x1['time_diff'] = x1.time_of_move_in.diff().dt.total_seconds()
#get inverse difference, last value is NaN so filna NaN by value
x1['time_diff']=x1['time_diff'].fillna(x1.time_of_move_in.diff(-1).dt.total_seconds())
#boolean indexing
x1 = x1[(x1['time_diff'] < hours4) & (x1['time_diff'] > -hours4)]
return x1
df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
#factorize values, add 1 for start from 1
df2.Case_number = pd.factorize(df2.Case_number)[0] + 1
df2.drop(['time_of_move_in', 'No'], axis=1, inplace=True)
print (df2)
Case_number fname sname landline address time_diff
0 1 Donald Hook 123 A -180.0
1 1 Jay Donald 123 A 180.0
2 2 Peter Jay 345 B -240.0
3 2 Jay Donald 345 B 240.0