我有一个关于用户参加在线课程的数据集。它具有'id','event','time'等功能。我将它们分组并希望了解用户在特定日期执行每个事件的频率。我想在几天内计算它们。
lt = log_train.groupby(['enrollment_id','event','time']).size()
print(lt)
enrollment_id event time
1 access 2014-06-14T09:38:39 2
2014-06-14T09:38:48 1
2014-06-19T06:21:16 2
2014-06-19T06:21:32 1
2014-06-19T06:21:45 1
..
200887 navigate 2014-07-24T03:27:16 1
200887 navigate 2014-07-24T03:27:16 1
page_close 2014-07-24T04:19:55 1
video 2014-07-24T04:19:57 1
200888 access 2014-07-24T03:48:14 2
discussion 2014-07-24T03:47:57 1
navigate 2014-07-24T03:47:17 1
2014-07-24T03:47:28 1
2014-07-24T03:48:01 1
根据我在另一个数据集中看到的信息,有userID,courseID和课程范围时间。
usercourse = pd.merge(enroll,date,how="left", on= 'course_id' )
enrollment_id username \
0 1 9Uee7oEuuMmgPx2IzPfFkWgkHZyPbWr0
1 3 1qXC7Fjbwp66GPQc6pHLfEuO8WKozxG4
2 4 FIHlppZyoq8muPbdVxS44gfvceX9zvU7
course_id from to
0 DPnLzkJJqOOPRJfBxIHbQEERiYHu5ila 2014-06-12 2014-07-11
1 7GRhBDsirIGkRZBtSMEzNTyDr2JQm4xx 2014-06-19 2014-07-18
2 DPnLzkJJqOOPRJfBxIHbQEERiYHu5ila 2014-06-12 2014-07-11
每个用户只有1门课程,所有课程都有相同的范围,30天。所以我想拥有的应该是这样的,
enrollment_id event #ofDays #ofActionTimes
1 access 2 2
10 6
30 2
..
200887 navigate 23 1
page_close 30 1
video 1 1
200888 access 12 2
discussion 2 1
navigate 5 3
29 4
**#ofDays means at the Nth day of a course.
#ofActionTimes means how often an event happens on the Nth day.**
由于每个课程都是从不同的日期开始的,所以我不知道如何在python上生成这个数据表单 希望有人能帮助我解决问题!
答案 0 :(得分:0)
IIUC,您可以使用merge
,groupby
和count
来获得您想要的内容。
首先,一些示例数据。这是基于您提供的数据,但我对其进行了修改,以便可以从起始数据中清楚地跟踪输出。
data1 = {"enrollment_id":[1,1,1,1,2,2,3,3,3],
"event":["access","access","access","navigate","access",
"page_close","navigate","navigate","video"],
"time":["2014-06-14T09:38:39", "2014-06-14T09:38:48",
"2014-06-19T06:21:16", "2014-06-19T06:21:32",
"2014-06-21T06:21:45", "2014-06-22T06:21:16",
"2014-06-19T06:21:32", "2014-06-20T06:21:16",
"2014-06-20T06:21:16"]}
data2 = {"enrollment_id":[1,2,3],
"username":["user1", "user2", "user3"],
"course_id":["course1", "course2", "course3"],
"course_from":["2014-06-12", "2014-06-19", "2014-06-12"],
"course_to":["2014-07-11", "2014-07-18", "2014-07-11"]}
df1 = pd.DataFrame(data1)
df1
enrollment_id event time
0 1 access 2014-06-14T09:38:39
1 1 access 2014-06-14T09:38:48
2 1 access 2014-06-19T06:21:16
3 1 navigate 2014-06-19T06:21:32
4 2 access 2014-06-21T06:21:45
5 2 page_close 2014-06-22T06:21:16
6 3 navigate 2014-06-19T06:21:32
7 3 navigate 2014-06-20T06:21:16
8 3 video 2014-06-20T06:21:16
df2 = pd.DataFrame(data2)
df2
course_id enrollment_id course_from course_to username
0 course1 1 2014-06-12 2014-07-11 user1
1 course2 2 2014-06-19 2014-07-18 user2
2 course3 3 2014-06-12 2014-07-11 user3
我们想知道特定event
的特定enrollment_id
发生了多少次,并且课程的每一天都有单独的计数。
从course_day_num
减去course_from
(课程开始日期),得出课程日期event_date
。
df = (df1.merge(df2[["enrollment_id", "course_from"]],
on="enrollment_id", how="left")
)
df["event_date"] = pd.to_datetime(pd.to_datetime(df1.time).dt.date)
df["course_from"] = pd.to_datetime(df["course_from"])
df["course_day_num"] = (df.event_date - df["course_from"]).dt.days
然后groupby
每个course_day_num
获得每个人每个课程日的活动数量:
groupby_cols = ["enrollment_id", "event", "event_date", "course_day_num"]
df.groupby(groupby_cols).event_date.count()
enrollment_id event event_date course_day_num
1 access 2014-06-14 2 2
2014-06-19 7 1
navigate 2014-06-19 7 1
2 access 2014-06-21 2 1
page_close 2014-06-22 3 1
3 navigate 2014-06-19 7 1
2014-06-20 8 1
video 2014-06-20 8 1
Name: event_date, dtype: int64