如何在python中找到每个id的日期之间的差异?

时间:2017-12-06 15:24:51

标签: python pandas

我有一个像这样的格式的pandas数据框:

student_id     subject_id   subject_date  
100             2000        2010-01-01
100             2001        2010-03-05
100             2002        2012-05-25
101             2000        2009-01-10
101             2001        2016-08-16
102             2000        2008-05-05
102             2003        2008-05-20
102             2004        2009-01-03
102             2005        2010-02-13

数据框已按student_idsubject_date排序。 目标是为每个subject_date获取student_id之间的差异。对于每个student_id,保证至少有2个不同的subject_id。结果数据框看起来像这样:

student_id     subject_id   subject_date  diff_in_dates  
100             2000        2010-01-01    NA
100             2001        2010-03-05    30
100             2002        2012-05-25    60
101             2000        2009-01-10    NA
101             2001        2016-08-16    3000
102             2000        2008-05-05    NA
102             2003        2008-05-20    15
102             2004        2009-01-03    180
102             2005        2010-02-13    370

diff_in_dates值只是这里的近似值而不是实际差值。

2 个答案:

答案 0 :(得分:4)

IIUC:

In [362]: df['diff_in_dates '] = df.groupby('student_id')['subject_date'].diff().dt.days

In [363]: df
Out[363]:
   student_id  subject_id subject_date  diff_in_dates
0         100        2000   2010-01-01             NaN
1         100        2001   2010-03-05            63.0
2         100        2002   2012-05-25           812.0
3         101        2000   2009-01-10             NaN
4         101        2001   2016-08-16          2775.0
5         102        2000   2008-05-05             NaN
6         102        2003   2008-05-20            15.0
7         102        2004   2009-01-03           228.0
8         102        2005   2010-02-13           406.0

答案 1 :(得分:3)

这在实践中很简单!查看dif()

df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff()

   student_id  subject_id subject_date diff_in_date
0         100        2000   2010-01-01          NaT
1         100        2001   2010-03-05      63 days
2         100        2002   2012-05-25     812 days
3         101        2000   2009-01-10          NaT
4         101        2001   2016-08-16    2775 days
5         102        2000   2008-05-05          NaT
6         102        2003   2008-05-20      15 days
7         102        2004   2009-01-03     228 days
8         102        2005   2010-02-13     406 days

dif()只是与当前记录和列上的前一条记录有所不同。很好,你的数据已经正确排序了!

以供将来使用dif()还可以使用参数将多行差异添加到结果中。所以请看下面的示例输出:

df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff(2)

# output
   student_id  subject_id subject_date diff_in_date
0         100        2000   2010-01-01          NaT
1         100        2001   2010-03-05          NaT
2         100        2002   2012-05-25     875 days
3         101        2000   2009-01-10          NaT
4         101        2001   2016-08-16          NaT
5         102        2000   2008-05-05          NaT
6         102        2003   2008-05-20          NaT
7         102        2004   2009-01-03     243 days
8         102        2005   2010-02-13     634 days