我有一个包含网站用户日志的大型数据框,我需要找到每个用户每次访问的持续时间。
我有350万行和450k单用户。
这是我的代码:
temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses
user_db = df["server.REMOTE_ADDR"]# df with all IP adresses
user_db = user_db.drop_duplicates() # drop duplicate IP
time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit
temp_moyen=[] # array for mean times
temp_min=[] # array for minimal time
temp_max=[] # array for max time
nb_visites=[] # array for number of visit
for k,user in enumerate(user_db.values): # for each user
print("User {}/{}").format(k+1,len(user_db.values))
t0=[] # time of beginning of visit
tf=[] # time of end of visit
times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime
i=1
last_t = times_db[0]
delta = 0
while i<len(times_db): # while there is still a timestamp in the list
t0.append(times_db[i-1]) # begin the first visit
delta=0
while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes
delta = (times_db[i]-last_t).total_seconds()
last_t = times_db[i]
i+=1
if i!=len(times_db): #if not last run
tf.append(times_db[i-2])
else: # if no more timestamp, record the last one as end of last visit
tf.append(times_db[-1])
if len(times_db)<=1: # if only one timestamp, tf = t0
tf.append(times_db[-1])
diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf
temp_moyen.append(np.mean(diff)) # add to the lists
temp_min.append(np.min(diff))
temp_max.append(np.max(diff))
nb_visites.append(len(diff))
user_db=user_db.to_frame() # convert to dataframe
user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits)
user_db["temp_min"]=temp_min
user_db["temp_max"]=temp_max
user_db["nb_visites"]=nb_visites
此代码有效,但速度非常慢:我的计算机上有200个用户/分钟。我能做些什么:
确定瓶颈?
加快速度?
编辑:
根据要求,我的数据如下所示:
对于每个用户,我有一个时间戳列表:[100, 101, 104, 106, 109, 200, 209, 211, 213]
我需要查找单个用户的访问次数,例如在这种情况下,它 代表两次访问,100-109和200-213。第一次访问持续了9次,第二次访问持续了13次,所以我可以得到平均,最小和最长访问时间。
编辑2: 瓶颈就在这里(每循环300毫秒中有277毫秒):
times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
我把它放在for循环之前的列表解析中,但它仍然很慢:
times_db_all = [df[temp == user]["server.date"].values for user in user_db.values]
%timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!
我的数据库看起来像这样:
user_ip | server.date
1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....
答案 0 :(得分:2)
继续我关于删除循环的评论:正如我所看到的,你有一堆活动的时间戳,你假设只要这些时间戳靠近在一起它们与单次访问有关,否则它们代表不同的访问。例如,[100, 101, 104, 106, 109, 200, 209, 211, 213]
代表两次访问100-109
和200-213
。为了加快速度,您可以使用scipy
:
import scipy
cutoff = 15
times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
delta = times[1:] - times[:-1]
which = delta > cutoff # identifies which gaps represent a new visit
N_visits = which.sum() + 1 # note the +1 for 'fence post'
L_boundaries = scipy.zeros((N_visits,)) # generating these arrays might be unnecessary and relatvely slow
R_boundaries = scipy.zeros((N_visits,))
L_boundaries[1:] = times[1:][which]
R_boundaries[:-1] = times[:-1][which]
visit_lengths = R_boundaries - L_boundaries
这可能会更快,但它可能已经比你当前的循环快得多。
以下内容可能会快一点,代价是代码清晰度
import scipy
cutoff = 15
times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
which = times[1:] - times[:-1] > cutoff
N_visits = which.sum() + 1 # fence post
visit_lengths = scipy.zeros((N_visits,)) # it is probably inevitable to have to generate this new array
visit_lengths[0] = times[:-1][which][0] - times[0]
visit_lengths[1:-1] = times[:-1][which][1:] - times[1:][which][:-1]
visit_lengths[-1] = times[-1] - times[1:][which][-1]
我也认为,如果你可能不太关心第一次和最后一次访问,可能值得考虑忽略这些。
基于OP EDIT的编辑
你可能应该看看http://pandas.pydata.org/pandas-docs/stable/indexing.html。我认为最慢的是,您为每个用户复制了部分数据框,即df[temp == user]
创建了一个新的数据框,并将其存储为times_db
,也许它会是更快地将结果值放入一个numpy数组?您还可以首先对整个数据帧执行解析到datetime。
答案 1 :(得分:1)
我无法看到示例数据,所以这是我的建议:
在您尝试优化代码之前,建议您使用profiler来获取代码的统计信息。
import cProfile
cProfile.run('foo()')
或python -m cProfile foo.py
您可以获得描述程序各个部分执行频率和持续时间的统计信息。这是优化的必要先决条件。
如果您的数据是多维数组和矩阵,请尝试pandas
或numpy
,这样可以加快您的代码。
有时程序速度慢的原因是磁盘I / O或数据库访问太多。所以请确保代码中没有显示。
尝试消除紧密循环中常见的子表达式。
希望这有帮助。