如果索引在Python Pandas中相互匹配,则减去日期

时间:2017-02-14 17:05:57

标签: python pandas dataframe

我有两个数据帧:

print (df1)
    ID      Birthday
0   A000    1990-01-01
1   A001    1991-05-05
2   A002    1970-10-01
3   A003    1980-07-07
4   A004    1945-08-15

print (df2)
    ID      Date from
0   A000    2010.01
1   A001    2012.01
2   A002    2010.01
3   A002    2010.01
4   A002    2010.11
5   A003    2009.05
6   A003    2010.01
7   A004    2010.01
8   A005    2007.11
9   A006    2017.01

df1由ID组成,而生日和df2包含ID和日期。 df2.ID中的某些值不在df1.ID中(即A005和A006)。

我正在尝试

如果df1.ID中存在df2.ID,我想计算df1.Birthday和df2.Date之间的差异。

到目前为止我做了什么

df1['Birthday'] = pd.to_datetime(df1['Birthday'])
df2['Date from'] = pd.to_datetime(df2['Date from'])

x1 = df1.set_index(['ID'])['Birthday']
x2 = df2.set_index(['ID'])['Date from']
x3 = x2.sub(x1,fill_value=0)

print(x3)
ID
A000   -7305 days +00:00:00.000002
A001   -7794 days +00:00:00.000002
A002    -273 days +00:00:00.000002
A002    -273 days +00:00:00.000002
A002    -273 days +00:00:00.000002
A003   -3840 days +00:00:00.000002
A003   -3840 days +00:00:00.000002
A004     8905 days 00:00:00.000002
A005        0 days 00:00:00.000002
A006        0 days 00:00:00.000002
dtype: timedelta64[ns]

由于ID A003具有相同的值但由不同的日期组成,因此存在错误。我不确定如何继续下一步。提前感谢您提供的任何帮助。

2 个答案:

答案 0 :(得分:1)

首先,我会合并数据框,以确保正确排队。然后在新列中减去两个日期列:

import pandas
from io import StringIO

data1 = StringIO("""\
ID      Birthday
A000    1990-01-01
A001    1991-05-05
A002    1970-10-01
A003    1980-07-07
A004    1945-08-15
""")

data2 = StringIO("""\
ID      Date_from
A000    2010.01
A001    2012.01
A002    2010.01
A002    2010.01
A002    2010.11
A003    2009.05
A003    2010.01
A004    2010.01
A005    2007.11
A006    2017.01
""")

x1 = pandas.read_table(data1, sep='\s+', parse_dates=['Birthday'])
x2 = pandas.read_table(data2, sep='\s+', parse_dates=['Date_from'])


data = (
    x2.merge(right=x1, left_on='ID', right_on='ID', how='left')
      .assign(Date_diff=lambda df: df['Date_from'] - df['Birthday'])
)

print(data)

这让我:

     ID  Date_from   Birthday  Date_diff
0  A000 2010-01-01 1990-01-01  7305 days
1  A001 2012-01-01 1991-05-05  7546 days
2  A002 2010-01-01 1970-10-01 14337 days
3  A002 2010-01-01 1970-10-01 14337 days
4  A002 2010-11-01 1970-10-01 14641 days
5  A003 2009-05-01 1980-07-07 10525 days
6  A003 2010-01-01 1980-07-07 10770 days
7  A004 2010-01-01 1945-08-15 23515 days
8  A005 2007-11-01        NaT        NaT
9  A006 2017-01-01        NaT        NaT

答案 1 :(得分:0)

使用dateutil包来获得年,月,日的差异:

from dateutil import relativedelta as rdelta
from datetime import date

d1 = date(2010,5,1)
d2 = date(2012,1,1)
rd = rdelta.relativedelta(d2,d1)