我正在使用数据集,我需要确定多个重复行之间的最大日期差异。我下面的代码可以满足我的要求(减去“我正在尝试将一个值设置在来自DataFrame的切片副本上的警告”),但我很好奇如何执行相同的任务而不必创建一个新的数据帧作为中间。我预计这是避免内存限制的好习惯,但是,我很难搞清楚这种类型的流程。任何指导都会非常有用!
df = pd.DataFrame({'Key': ['10003', '10009', '10009', '10009', '10009','10034','10034', '10034'],
'Num1': [12,13,13,13,13,14,14,14],
'Num2': [121,122,122,124,125,126,127,128],
'Date1': [20120506, 20120506, 20120506,20120506,20120620,20120206,20120206,20120405],
'Date2': [20120528, 20120507, 20120615,20120629,20120621,20120305,20120506,20120506]})
df_dup = df[df.duplicated(subset=['Key', 'Num1','Num2','Date1'],keep=False)]
df = df.drop_duplicates(subset=['Key','Num1','Num2','Date1'],keep=False)
df_dup['Date2'] = pd.to_datetime(df_dup['Date2'], format='%Y%m%d')
df_dup['Date1'] = pd.to_datetime(df_dup['Date1'], format='%Y%m%d')
df_dup['DateDiff'] = (df_dup['Date2'] - df_dup['Date1']).dt.days
df_dup = df_dup.sort_values('DateDiff', ascending=False).drop_duplicates(subset=['Key','Num1','Num2','Date1'])
df = pd.concat([df,df_dup])
我的代码步骤:
最终结果是比原始df少一行。
答案 0 :(得分:2)
我相信你只想处理由布尔掩码m
过滤的行:
m = df.duplicated(subset=['Key', 'Num1','Num2','Date1'],keep=False)
d1 = pd.to_datetime(df.loc[m, 'Date2'], format='%Y%m%d')
d2 = pd.to_datetime(df.loc[m, 'Date1'], format='%Y%m%d')
df['DateDiff'] = (d1 - d2).dt.days
m1 = (df.loc[m, :]
.sort_values('DateDiff', ascending=False)
.duplicated(subset=['Key','Num1','Num2','Date1'])
.reindex(df.index, fill_value=False))
df = df[~m1]
print (df)
Date1 Date2 Key Num1 Num2 DateDiff
0 20120506 20120528 10003 12 121 NaN
2 20120506 20120615 10009 13 122 40.0
3 20120506 20120629 10009 13 124 NaN
4 20120620 20120621 10009 13 125 NaN
5 20120206 20120305 10034 14 126 NaN
6 20120206 20120506 10034 14 127 NaN
7 20120405 20120506 10034 14 128 NaN