比较具有空值的日期

时间:2014-11-28 19:14:05

标签: datetime pandas timedelta

我有两列。我想检查它们之间的差异是否介于0到10天之间。其中一个字段通常包含空值。

df['Diff'] = (df['Dt1'] - df['Dt2'])

def wdw(x):
    if pd.notnull(x):
        if type(x) !=long:
            if type(timedelta(days=10)) != long:
                if x > timedelta(days=10):
                    return 1
    else:
        return 0

df['Diff'].df(wdw)

当我运行此操作时,出现以下错误。

TypeError: can't compare datetime.timedelta to long

当我看到df ['Diff']的值时,它们似乎都是timedeltas。知道这里发生了什么吗?似乎根据两个日期字段之间的差异创建一个指标应该比这更容易......

3 个答案:

答案 0 :(得分:1)

df['Diff']中的值为numpy timedelta64s。您可以将它们与pd.Timedeltas进行比较;见下文。

此外,您无需致电df['Diff'].apply(wdw),系列中的每个值都会调用wdw;您可以将整个系列与pd.Timedelta进行比较:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
                   'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan

df['Diff'] = df['Dt1'] - df['Dt2']
print(df)

#          Dt1        Dt2    Diff
# 0 2010-01-01        NaT     NaT
# 1 2010-01-06 2010-01-05  1 days
# 2 2010-01-11 2010-01-08  3 days
# 3 2010-01-16        NaT     NaT
# 4 2010-01-21 2010-01-14  7 days
# 5 2010-01-26 2010-01-17  9 days
# 6 2010-01-31        NaT     NaT
# 7 2010-02-05 2010-01-23 13 days
# 8 2010-02-10 2010-01-26 15 days
# 9 2010-02-15        NaT     NaT

mask = (df['Diff'] < pd.Timedelta(days=10)) & (pd.Timedelta(days=0) < df['Diff'])
print(mask)

产量

0    False
1     True
2     True
3    False
4     True
5     True
6    False
7    False
8    False
9    False
Name: Diff, dtype: bool

在Pandas v.0.15中引入了{p> pd.Timedelta。以下是使用np.timedela64s的旧版Pandas的解决方法:

mask = ((df['Diff'] / np.timedelta64(10, 'D') < 10)
        & (df['Diff'] / np.timedelta64(10, 'D') > 0))

答案 1 :(得分:1)

这也有效,但不如unutbu提供的答案好。

def wdw(x):
    if pd.notnull(x):
        if x/np.timedelta64(1,'D') <= 10:
            if x/np.timedelta64(1,'D') >0:
                return 1
    else:
        return 0

df['Diff'].df(wdw)

答案 2 :(得分:0)

使用assign 创建差异日期dt1 和dt2 列。然后使用 timedelta 获取 0 天和 10 天变量来比较差异,然后屏蔽输出结果。

df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
                'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan

print(df)
zero_days=timedelta(days=0)
ten_days=timedelta(days=10)
print(zero_days,ten_days)

df['Diff']=np.empty(len(df))
df=df.assign(Diff=lambda row:  row['Dt1']-row['Dt2'])
mask=(df['Diff'] >=zero_days)&(df['Diff'] <=ten_days)  
print(df[mask])

输出:

  Dt1        Dt2   Diff
1 2010-01-06 2010-01-05 1 days
2 2010-01-11 2010-01-08 3 days
4 2010-01-21 2010-01-14 7 days
5 2010-01-26 2010-01-17 9 days