需要速度:慢速嵌套组并在熊猫中应用

时间:2014-02-08 00:10:05

标签: python pandas

我正在对DataFrame执行复杂的转换。我认为Pandas会很快,但我设法做到的唯一方法是使用一些嵌套的groupbys和apply,使用lambda函数,它很慢。看起来应该有内置的,更快的方法。在n_rows = 1000时它是2秒,但我将做10 ^ 7行,所以这太慢了。很难解释我们正在做什么,所以这里是代码和配置文件,然后我将解释:

n_rows = 1000

d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping

f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame

q = d.groupby(grps).apply(h) #Slow



824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
221770    0.105    0.000    0.105    0.000 {isinstance}
  7329    0.104    0.000    0.217    0.000 index.py:86(__new__)
  8309    0.089    0.000    0.423    0.000 series.py:430(__new__)
  5375    0.081    0.000    0.081    0.000 {method 'reduce' of 'numpy.ufunc' objects}
 34225    0.068    0.000    0.133    0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779    0.067    0.000    0.067    0.000 {numpy.core.multiarray.array}
  5349    0.065    0.000    0.567    0.000 series.py:709(_get_values)
 985/1    0.063    0.000    1.847    1.847 groupby.py:608(apply)
  5349    0.056    0.000    0.198    0.000 _methods.py:42(_mean)
  5358    0.050    0.000    0.232    0.000 index.py:332(__getitem__)
  8309    0.049    0.000    0.228    0.000 series.py:3299(_sanitize_array)
  9296    0.047    0.000    0.116    0.000 index.py:1341(__new__)
   984    0.039    0.000    0.092    0.000 algorithms.py:105(factorize)

按分组对DataFrame行进行分组。对于每个分组,对于每一行,按那些相同的值分组(即,所有具有值3而不是全部具有值4)。对于值分组中的每个索引,在dgs中查找相应的索引并进行平均。然后是行分组的平均值。

::呼出::

任何关于如何重新安排速度的建议都将受到赞赏。

1 个答案:

答案 0 :(得分:5)

您可以通过一个多级组来执行apply和groupby,这是代码:

import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed

seed(42)
n_rows = 1000

d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping

f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame

print d.groupby(grps).apply(h) #Slow

### my code starts from here ###

def group_process(df2):
    s = df2.stack()
    v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
    return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)

print d.groupby(grps).apply(group_process)

输出:

               1         2         3         4         5         6         7  \
(1, 2]  4.621575  4.625887  4.775235  4.954321  4.566441  4.568111  4.835664   
(2, 3]  4.446347  4.138528  4.862613  4.800538  4.582721  4.595890  4.794183   
(3, 4]  4.776144  4.510119  4.391729  4.392262  4.930556  4.695776  4.630068   

               8         9  
(1, 2]  4.246085  4.520384  
(2, 3]  5.237360  4.418934  
(3, 4]  4.829167  4.681548  

[3 rows x 9 columns]
               1         2         3         4         5         6         7  \
(1, 2]  4.621575  4.625887  4.775235  4.954321  4.566441  4.568111  4.835664   
(2, 3]  4.446347  4.138528  4.862613  4.800538  4.582721  4.595890  4.794183   
(3, 4]  4.776144  4.510119  4.391729  4.392262  4.930556  4.695776  4.630068   

               8         9  
(1, 2]  4.246085  4.520384  
(2, 3]  5.237360  4.418934  
(3, 4]  4.829167  4.681548  

[3 rows x 9 columns]

它快了大约70倍,但我不知道它是否适用于10 ** 7行。