Python:使用pandas.pivot_table来展平活动日志并显示执行活动所花费的时间

时间:2016-06-22 20:40:54

标签: python pandas lambda pivot-table

我正在使用python和pandas来展平我们的VLE(Blackboard inc。)活动表。我试图总计每天花在访问课程上的总时间,而不是在活动日志/表中进行其他非课程活动。

我在下面创建了一些假数据和代码(python)来模拟问题以及我在哪里挣扎。这是我正在努力的flattened_v2部分,因为它接近我的实际情况。

日志数据通常如下所示,我在下面的代码示例中创建了它:(下面代码中的活动数据框)

         DAY    event somethingelse  timespent             logtime
0 2013-01-02     null           foo   0.274139 2013-01-02 00:00:00
0 2013-01-02  course1           foo   1.791061 2013-01-02 01:00:00
1 2013-01-02  course1           foo   0.824152 2013-01-02 02:00:00
2 2013-01-02  course1           foo   1.626477 2013-01-02 03:00:00

我在真实数据中有一个名为logtime的字段。这是一个实际的日期时间而不是花费的时间字段(也包括在我正在试验的假数据中)。

如何记录在event = course(许多课程)上花费的总时间(使用logtime)?

每条记录都包含显示访问页面的日期时间的logtime 下一个记录logtime显示访问新页面的日期时间,因此保留旧页面(足够接近)。如何获得事件不为空的总时间。如果我只使用最大/最小值,那么这会导致高估,因为课程访问中的间隙(其中event = null)也包括在内。我简化了数据,使每条记录增加1小时,这不是真实情况。

感谢您的任何提示 杰森

代码是:

# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
# 
#

import pandas as pd
import numpy as np
import datetime


# Creating fake data with string null and course1, course2
df = pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(5)),
    'event' : "course1",
    'somethingelse' : 'foo' })

df2 = pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(5)),
    'event' : "course2",
    'somethingelse' : 'foo' })

dfN =pd.DataFrame({
    'DAY' : pd.Timestamp('20130102'),
    'timespent' : abs(np.random.randn(1)),
    'event' : "null",
    'somethingelse' : 'foo' })


dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times

# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data. 
#

print activity # This is quite close to real data.

# Fake activity date created above to demo question.

# *********************************************
# Actual code to extract time spent on courses
# *********************************************

# Lambda function to aggregate data -max and min

# Where time diff each minutes.
def agg_timespent(a, b):
    c = abs(b-a)
    return c

# Where the time difference is not explicit but is 
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
    # In real data b and a are strings
    # b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
    # a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
    c = abs(b-a).seconds
    return c    



# Remove 'null' data as that's not of interest here. 
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]

print activity  # This shows *just* course activity info

# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)


# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times

flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)

print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2

1 个答案:

答案 0 :(得分:1)

你是对的(在你的评论中):你需要dataframe.shift

如果我正确理解您的问题,您希望记录自上次时间戳以来已过去的时间,因此时间戳表示活动的开始,以及最后一次活动是{{1}我们不应该记录任何经过的时间。假设一切正确,请使用null为时差添加一列:

shift

现在第一行将显示特殊的“非时间”值activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift() ,但这很好,因为我们无法计算那里的经过时间。接下来,我们可以为刚刚发生NaT事件的任何已用时间填写一些NaT值:

null

当我们想知道mask = activity.event == 'null' activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT 花了多少时间时,我们必须再次转移,因为course1行的索引会产生course1开始的行。我们需要course1完成/已完成的那些:

course1

在您的示例中,activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum() course1的返回时间为15小时。