在pandas数据帧中定位元素时减少执行时间

时间:2017-01-11 13:02:49

标签: python pandas

我在这里问了另一个question,我发现了我脚本中的瓶颈,所以我更清楚地问我的问题。我的代码如下所示:

temp=df["IPs"]
times_db_all = [df[temp == user]["time"].values for user in user_db.values]

%timeit times_db_all = [df_temp[temp == user]["time"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

我的df看起来像这样:

IPs        time
1.1.1.1    datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3    datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4    datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....

user_db.values = ["1.1.1.1","3.3.3.3","4.4.4.4",...]

目标是为每个用户提供df“time”列中所有时间戳的列表。然后,我使用此列表来检查用户在网站上停留的时间以及他访问过的次数:

       IP       time
    1.1.1.1    [datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()), 
               datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),   
               datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())]
    3.3.3.3    [datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc())]
    4.4.4.4    [datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())]

我的问题是我有350万行,并且它会大大减慢这一行的执行速度。

做同样事情的更快的方法是什么?

2 个答案:

答案 0 :(得分:3)

你不应该像你一样使用for循环来做单独的布尔选择。 isin方法专门用于此目的,并将选择与user_db中的任何值匹配的行。试试吧

df.loc[df['IPs'].isin(user_db.values), "time"]

答案 1 :(得分:1)

请尝试# create a random dataframe with your data def create_ip(): return '.'.join([str(randint(0,255)) for i in range(4)]) def create_dt(): return datetime.datetime(2017, 1, randint(1,10), randint(0,23), randint(0,59)) df = pd.DataFrame({'ip': [create_ip() for i in range(10)]*10, 'time': [create_dt() for i in range(100)]}) # use groupby df.groupby('ip')['time'].apply(list) [Out] ip 127.140.64.48 [2017-01-10 04:23:00, 2017-01-03 16:55:00, 201... 150.206.39.49 [2017-01-02 03:07:00, 2017-01-07 21:59:00, 201... 186.188.130.77 [2017-01-04 13:03:00, 2017-01-05 19:23:00, 201... 190.152.20.150 [2017-01-02 12:47:00, 2017-01-03 23:55:00, 201... 208.235.194.243 [2017-01-10 08:55:00, 2017-01-08 08:07:00, 201... 223.138.217.41 [2017-01-02 22:36:00, 2017-01-10 02:16:00, 201... 226.176.251.244 [2017-01-03 12:08:00, 2017-01-07 06:14:00, 201... 24.21.19.130 [2017-01-07 14:05:00, 2017-01-05 04:25:00, 201... 50.167.31.84 [2017-01-10 03:28:00, 2017-01-03 11:05:00, 201... 83.56.204.14 [2017-01-08 12:46:00, 2017-01-01 03:05:00, 201... Name: time, dtype: object # compare times %timeit df.groupby('ip')['time'].apply(list) [Out] 100 loops, best of 3: 2.69 ms per loop %timeit times_db_all = [df[df['ip'] == user]['time'].values for user in df['ip'].unique()] [Out] 100 loops, best of 3: 10.6 ms per loop ,如下所示..

'ip'

通过将object PartialFunction extends App { val divider : PartialFunction[Int,Int] = { case d : Int if d != 0 => 42/d } println(divider(0)) //println(fraction(0)) } 设置为索引然后对索引进行分组,您可以更快地实现此目标。