Question

我在作业ID级别有一个数据框，其中包括提交日期，学生ID。我想查找最近12个月中学生提交的作业数量（不包括最新条目）。分配ID是唯一键。我希望将累积计数作为作业ID的基础。

我尝试使用groupby来执行此步骤，但找不到所需的输出。我要用python回答。

我有什么

Assmt id    student id  date of submission
106473754   100357          2/1/2016
102485554   100357          3/1/2016
108474032   100357          4/1/2016
101663805   100357          2/1/2017
307953885   100364          5/1/2017
307252429   100364          7/1/2017
304205214   100364          11/1/2017
304041247   100364          11/1/2017
512459298   100364          2/1/2018

我想要什么

student id  date of submission  count_in_12_mon
100357            2/1/2017                       3
100364            2/1/2018                       4

Answer 1

您可能需要使用max找到每个组的transform值，然后将datetime转换为月数并与所有date of submission进行比较，然后再将值赋回，使用agg

s=df.groupby('studentid')['dateofsubmission'].transform('max')
s1=(s.dt.year*12+s.dt.month-df.dateofsubmission.dt.year*12-df.dateofsubmission.dt.month)
df['New']=((s1>0)&(s1<=12))
yourdf=df.groupby('studentid').agg({'New':'sum','dateofsubmission':'last'}).reset_index()
yourdf
Out[851]: 
   studentid dateofsubmission  New
0     100357       2017-02-01  3.0
1     100364       2018-02-01  4.0

Answer 2

尝试使用以下代码：

console.log("HELLO")

现在：

df['date of submission'] = pd.to_datetime(df['date of submission'])
df2 = df.groupby('student id', as_index=False)['date of submission'].last()
df2['count_in_12_mon'] = df.groupby('student id')['date of submission'].first().dt.year.tolist()
df2['count_in_12_mon'] = df2.apply(lambda x: df.loc[(df['date of submission'].dt.year == x[2]) & (df['student id'] == x[0])].count(), axis=1)

是：

print(df2)

执行groupby以查找日期范围之间分配ID的累计计数

2 个答案: