我在这里问了另一个question,我发现了我脚本中的瓶颈,所以我更清楚地问我的问题。我的代码如下所示:
temp=df["IPs"]
times_db_all = [df[temp == user]["time"].values for user in user_db.values]
%timeit times_db_all = [df_temp[temp == user]["time"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!
我的df看起来像这样:
IPs time
1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....
user_db.values = ["1.1.1.1","3.3.3.3","4.4.4.4",...]
目标是为每个用户提供df“time”列中所有时间戳的列表。然后,我使用此列表来检查用户在网站上停留的时间以及他访问过的次数:
IP time
1.1.1.1 [datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())]
3.3.3.3 [datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc())]
4.4.4.4 [datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())]
我的问题是我有350万行,并且它会大大减慢这一行的执行速度。
做同样事情的更快的方法是什么?
答案 0 :(得分:3)
你不应该像你一样使用for循环来做单独的布尔选择。 isin
方法专门用于此目的,并将选择与user_db
中的任何值匹配的行。试试吧
df.loc[df['IPs'].isin(user_db.values), "time"]
答案 1 :(得分:1)
请尝试# create a random dataframe with your data
def create_ip(): return '.'.join([str(randint(0,255)) for i in range(4)])
def create_dt(): return datetime.datetime(2017, 1, randint(1,10), randint(0,23), randint(0,59))
df = pd.DataFrame({'ip': [create_ip() for i in range(10)]*10,
'time': [create_dt() for i in range(100)]})
# use groupby
df.groupby('ip')['time'].apply(list)
[Out]
ip
127.140.64.48 [2017-01-10 04:23:00, 2017-01-03 16:55:00, 201...
150.206.39.49 [2017-01-02 03:07:00, 2017-01-07 21:59:00, 201...
186.188.130.77 [2017-01-04 13:03:00, 2017-01-05 19:23:00, 201...
190.152.20.150 [2017-01-02 12:47:00, 2017-01-03 23:55:00, 201...
208.235.194.243 [2017-01-10 08:55:00, 2017-01-08 08:07:00, 201...
223.138.217.41 [2017-01-02 22:36:00, 2017-01-10 02:16:00, 201...
226.176.251.244 [2017-01-03 12:08:00, 2017-01-07 06:14:00, 201...
24.21.19.130 [2017-01-07 14:05:00, 2017-01-05 04:25:00, 201...
50.167.31.84 [2017-01-10 03:28:00, 2017-01-03 11:05:00, 201...
83.56.204.14 [2017-01-08 12:46:00, 2017-01-01 03:05:00, 201...
Name: time, dtype: object
# compare times
%timeit df.groupby('ip')['time'].apply(list)
[Out] 100 loops, best of 3: 2.69 ms per loop
%timeit times_db_all = [df[df['ip'] == user]['time'].values for user in df['ip'].unique()]
[Out] 100 loops, best of 3: 10.6 ms per loop
,如下所示..
'ip'
通过将object PartialFunction extends App {
val divider : PartialFunction[Int,Int] = {
case d : Int if d != 0 => 42/d
}
println(divider(0))
//println(fraction(0))
}
设置为索引然后对索引进行分组,您可以更快地实现此目标。