Question

我有一个包含网站用户日志的大型数据框，我需要找到每个用户每次访问的持续时间。

我有350万行和450k单用户。

这是我的代码：

temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses
user_db = df["server.REMOTE_ADDR"]# df with all IP adresses

user_db = user_db.drop_duplicates() # drop duplicate IP
time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit
temp_moyen=[] # array for mean times
temp_min=[] # array for minimal time
temp_max=[] # array for max time
nb_visites=[] # array for number of visit

for k,user in enumerate(user_db.values): # for each user
    print("User {}/{}").format(k+1,len(user_db.values))
    t0=[] # time of beginning of visit
    tf=[] # time of end of visit
    times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
    times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime
    i=1
    last_t = times_db[0]
    delta = 0
    while i<len(times_db): # while there is still a timestamp in the list
        t0.append(times_db[i-1]) # begin the first visit
        delta=0
        while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes
            delta = (times_db[i]-last_t).total_seconds()
            last_t = times_db[i]
            i+=1
        if i!=len(times_db): #if not last run
            tf.append(times_db[i-2])
        else: # if no more timestamp, record the last one as end of last visit
            tf.append(times_db[-1])
    if len(times_db)<=1: # if only one timestamp, tf = t0
        tf.append(times_db[-1])

    diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf
    temp_moyen.append(np.mean(diff)) # add to the lists
    temp_min.append(np.min(diff))
    temp_max.append(np.max(diff))
    nb_visites.append(len(diff))

user_db=user_db.to_frame() # convert to dataframe
user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits)
user_db["temp_min"]=temp_min
user_db["temp_max"]=temp_max
user_db["nb_visites"]=nb_visites

此代码有效，但速度非常慢：我的计算机上有200个用户/分钟。我能做些什么：

确定瓶颈？
加快速度？

编辑：根据要求，我的数据如下所示：对于每个用户，我有一个时间戳列表：[100, 101, 104, 106, 109, 200, 209, 211, 213]

我需要查找单个用户的访问次数，例如在这种情况下，它代表两次访问，100-109和200-213。第一次访问持续了9次，第二次访问持续了13次，所以我可以得到平均，最小和最长访问时间。

编辑2：瓶颈就在这里（每循环300毫秒中有277毫秒）：

times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user

我把它放在for循环之前的列表解析中，但它仍然很慢：

times_db_all = [df[temp == user]["server.date"].values for user in user_db.values]

%timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

我的数据库看起来像这样：

user_ip  | server.date
1.1.1.1    datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3    datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4    datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....

Answer 1

继续我关于删除循环的评论：正如我所看到的，你有一堆活动的时间戳，你假设只要这些时间戳靠近在一起它们与单次访问有关，否则它们代表不同的访问。例如，[100, 101, 104, 106, 109, 200, 209, 211, 213]代表两次访问100-109和200-213。为了加快速度，您可以使用scipy：

执行以下操作

import scipy

cutoff = 15

times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
delta = times[1:] - times[:-1]
which = delta > cutoff # identifies which gaps represent a new visit
N_visits = which.sum() + 1 # note the +1 for 'fence post'
L_boundaries = scipy.zeros((N_visits,)) # generating these arrays might be unnecessary and relatvely slow
R_boundaries = scipy.zeros((N_visits,))
L_boundaries[1:] = times[1:][which]
R_boundaries[:-1] = times[:-1][which]
visit_lengths = R_boundaries - L_boundaries

这可能会更快，但它可能已经比你当前的循环快得多。

以下内容可能会快一点，代价是代码清晰度

import scipy

cutoff = 15

times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
which = times[1:] - times[:-1] > cutoff
N_visits = which.sum() + 1 # fence post
visit_lengths = scipy.zeros((N_visits,)) # it is probably inevitable to have to generate this new array
visit_lengths[0]    = times[:-1][which][0] - times[0]
visit_lengths[1:-1] = times[:-1][which][1:] - times[1:][which][:-1]
visit_lengths[-1]   = times[-1] - times[1:][which][-1]

我也认为，如果你可能不太关心第一次和最后一次访问，可能值得考虑忽略这些。

基于OP EDIT的编辑

你可能应该看看http://pandas.pydata.org/pandas-docs/stable/indexing.html。我认为最慢的是，您为每个用户复制了部分数据框，即df[temp == user]创建了一个新的数据框，并将其存储为times_db，也许它会是更快地将结果值放入一个numpy数组？您还可以首先对整个数据帧执行解析到datetime。

Answer 2

我无法看到示例数据，所以这是我的建议：

在您尝试优化代码之前，建议您使用profiler来获取代码的统计信息。

import cProfile
cProfile.run('foo()')

或python -m cProfile foo.py 您可以获得描述程序各个部分执行频率和持续时间的统计信息。这是优化的必要先决条件。
如果您的数据是多维数组和矩阵，请尝试pandas或numpy，这样可以加快您的代码。
有时程序速度慢的原因是磁盘I / O或数据库访问太多。所以请确保代码中没有显示。
尝试消除紧密循环中常见的子表达式。

希望这有帮助。

减少pandas代码的执行时间

2 个答案: