我有一个像这样的格式的pandas数据框:
student_id subject_id subject_date
100 2000 2010-01-01
100 2001 2010-03-05
100 2002 2012-05-25
101 2000 2009-01-10
101 2001 2016-08-16
102 2000 2008-05-05
102 2003 2008-05-20
102 2004 2009-01-03
102 2005 2010-02-13
数据框已按student_id
和subject_date
排序。
目标是为每个subject_date
获取student_id
之间的差异。对于每个student_id
,保证至少有2个不同的subject_id
。结果数据框看起来像这样:
student_id subject_id subject_date diff_in_dates
100 2000 2010-01-01 NA
100 2001 2010-03-05 30
100 2002 2012-05-25 60
101 2000 2009-01-10 NA
101 2001 2016-08-16 3000
102 2000 2008-05-05 NA
102 2003 2008-05-20 15
102 2004 2009-01-03 180
102 2005 2010-02-13 370
diff_in_dates
值只是这里的近似值而不是实际差值。
答案 0 :(得分:4)
IIUC:
In [362]: df['diff_in_dates '] = df.groupby('student_id')['subject_date'].diff().dt.days
In [363]: df
Out[363]:
student_id subject_id subject_date diff_in_dates
0 100 2000 2010-01-01 NaN
1 100 2001 2010-03-05 63.0
2 100 2002 2012-05-25 812.0
3 101 2000 2009-01-10 NaN
4 101 2001 2016-08-16 2775.0
5 102 2000 2008-05-05 NaN
6 102 2003 2008-05-20 15.0
7 102 2004 2009-01-03 228.0
8 102 2005 2010-02-13 406.0
答案 1 :(得分:3)
这在实践中很简单!查看dif()
:
df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff()
student_id subject_id subject_date diff_in_date
0 100 2000 2010-01-01 NaT
1 100 2001 2010-03-05 63 days
2 100 2002 2012-05-25 812 days
3 101 2000 2009-01-10 NaT
4 101 2001 2016-08-16 2775 days
5 102 2000 2008-05-05 NaT
6 102 2003 2008-05-20 15 days
7 102 2004 2009-01-03 228 days
8 102 2005 2010-02-13 406 days
dif()
只是与当前记录和列上的前一条记录有所不同。很好,你的数据已经正确排序了!
以供将来使用dif()
还可以使用参数将多行差异添加到结果中。所以请看下面的示例输出:
df1['diff_in_date'] = df1.groupby('student_id')['subject_date'].diff(2)
# output
student_id subject_id subject_date diff_in_date
0 100 2000 2010-01-01 NaT
1 100 2001 2010-03-05 NaT
2 100 2002 2012-05-25 875 days
3 101 2000 2009-01-10 NaT
4 101 2001 2016-08-16 NaT
5 102 2000 2008-05-05 NaT
6 102 2003 2008-05-20 NaT
7 102 2004 2009-01-03 243 days
8 102 2005 2010-02-13 634 days