如何避免创建中间数据框?

时间:2018-02-18 20:11:56

标签: python pandas dataframe concat

我正在使用数据集,我需要确定多个重复行之间的最大日期差异。我下面的代码可以满足我的要求(减去“我正在尝试将一个值设置在来自DataFrame的切片副本上的警告”),但我很好奇如何执行相同的任务而不必创建一个新的数据帧作为中间。我预计这是避免内存限制的好习惯,但是,我很难搞清楚这种类型的流程。任何指导都会非常有用!

df = pd.DataFrame({'Key': ['10003', '10009', '10009', '10009', '10009','10034','10034', '10034'], 
               'Num1': [12,13,13,13,13,14,14,14],
               'Num2': [121,122,122,124,125,126,127,128],
              'Date1': [20120506, 20120506, 20120506,20120506,20120620,20120206,20120206,20120405],
              'Date2': [20120528, 20120507, 20120615,20120629,20120621,20120305,20120506,20120506]})


df_dup = df[df.duplicated(subset=['Key', 'Num1','Num2','Date1'],keep=False)]
df = df.drop_duplicates(subset=['Key','Num1','Num2','Date1'],keep=False)
df_dup['Date2'] = pd.to_datetime(df_dup['Date2'], format='%Y%m%d')
df_dup['Date1'] = pd.to_datetime(df_dup['Date1'], format='%Y%m%d')
df_dup['DateDiff'] = (df_dup['Date2'] - df_dup['Date1']).dt.days
df_dup = df_dup.sort_values('DateDiff', ascending=False).drop_duplicates(subset=['Key','Num1','Num2','Date1'])
df = pd.concat([df,df_dup])

我的代码步骤:

  • 1a上。找到所有重复的行并存储在df_dup
  • 1b中。删除原始df中的重复行
  • 2在df_dup中,将日期字段转换为日期时间以进行比较
  • 3在df_dup中,为日期差异
  • 创建新列
  • 4只保留最大'DateDiff'行
  • 5最后,连接df和df_dup

最终结果是比原始df少一行。

1 个答案:

答案 0 :(得分:2)

我相信你只想处理由布尔掩码m过滤的行:

m = df.duplicated(subset=['Key', 'Num1','Num2','Date1'],keep=False)

d1 = pd.to_datetime(df.loc[m, 'Date2'], format='%Y%m%d')
d2 = pd.to_datetime(df.loc[m, 'Date1'], format='%Y%m%d')
df['DateDiff'] = (d1 - d2).dt.days
m1 = (df.loc[m, :]
        .sort_values('DateDiff', ascending=False)
        .duplicated(subset=['Key','Num1','Num2','Date1'])
        .reindex(df.index, fill_value=False))

df = df[~m1]
print (df)
      Date1     Date2    Key  Num1  Num2  DateDiff
0  20120506  20120528  10003    12   121       NaN
2  20120506  20120615  10009    13   122      40.0
3  20120506  20120629  10009    13   124       NaN
4  20120620  20120621  10009    13   125       NaN
5  20120206  20120305  10034    14   126       NaN
6  20120206  20120506  10034    14   127       NaN
7  20120405  20120506  10034    14   128       NaN