我正在使用python和pandas来展平我们的VLE(Blackboard inc。)活动表。我试图总计每天花在访问课程上的总时间,而不是在活动日志/表中进行其他非课程活动。
我在下面创建了一些假数据和代码(python)来模拟问题以及我在哪里挣扎。这是我正在努力的flattened_v2部分,因为它接近我的实际情况。
日志数据通常如下所示,我在下面的代码示例中创建了它:(下面代码中的活动数据框)
DAY event somethingelse timespent logtime
0 2013-01-02 null foo 0.274139 2013-01-02 00:00:00
0 2013-01-02 course1 foo 1.791061 2013-01-02 01:00:00
1 2013-01-02 course1 foo 0.824152 2013-01-02 02:00:00
2 2013-01-02 course1 foo 1.626477 2013-01-02 03:00:00
我在真实数据中有一个名为logtime的字段。这是一个实际的日期时间而不是花费的时间字段(也包括在我正在试验的假数据中)。
如何记录在event = course(许多课程)上花费的总时间(使用logtime)?
每条记录都包含显示访问页面的日期时间的logtime 下一个记录logtime显示访问新页面的日期时间,因此保留旧页面(足够接近)。如何获得事件不为空的总时间。如果我只使用最大/最小值,那么这会导致高估,因为课程访问中的间隙(其中event = null)也包括在内。我简化了数据,使每条记录增加1小时,这不是真实情况。
感谢您的任何提示 杰森
代码是:
# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
#
#
import pandas as pd
import numpy as np
import datetime
# Creating fake data with string null and course1, course2
df = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course1",
'somethingelse' : 'foo' })
df2 = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course2",
'somethingelse' : 'foo' })
dfN =pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(1)),
'event' : "null",
'somethingelse' : 'foo' })
dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times
# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data.
#
print activity # This is quite close to real data.
# Fake activity date created above to demo question.
# *********************************************
# Actual code to extract time spent on courses
# *********************************************
# Lambda function to aggregate data -max and min
# Where time diff each minutes.
def agg_timespent(a, b):
c = abs(b-a)
return c
# Where the time difference is not explicit but is
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
# In real data b and a are strings
# b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
# a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
c = abs(b-a).seconds
return c
# Remove 'null' data as that's not of interest here.
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]
print activity # This shows *just* course activity info
# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)
# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times
flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)
print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2
答案 0 :(得分:1)
你是对的(在你的评论中):你需要dataframe.shift
。
如果我正确理解您的问题,您希望记录自上次时间戳以来已过去的时间,因此时间戳表示活动的开始,以及最后一次活动是{{1}我们不应该记录任何经过的时间。假设一切正确,请使用null
为时差添加一列:
shift
现在第一行将显示特殊的“非时间”值activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift()
,但这很好,因为我们无法计算那里的经过时间。接下来,我们可以为刚刚发生NaT
事件的任何已用时间填写一些NaT
值:
null
当我们想知道mask = activity.event == 'null'
activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT
花了多少时间时,我们必须再次转移,因为course1
行的索引会产生course1
开始的行。我们需要course1
完成/已完成的那些:
course1
在您的示例中,activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum()
和course1
的返回时间为15小时。