我正在对DataFrame执行复杂的转换。我认为Pandas会很快,但我设法做到的唯一方法是使用一些嵌套的groupbys和apply,使用lambda函数,它很慢。看起来应该有内置的,更快的方法。在n_rows = 1000时它是2秒,但我将做10 ^ 7行,所以这太慢了。很难解释我们正在做什么,所以这里是代码和配置文件,然后我将解释:
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
q = d.groupby(grps).apply(h) #Slow
824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
221770 0.105 0.000 0.105 0.000 {isinstance}
7329 0.104 0.000 0.217 0.000 index.py:86(__new__)
8309 0.089 0.000 0.423 0.000 series.py:430(__new__)
5375 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects}
34225 0.068 0.000 0.133 0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779 0.067 0.000 0.067 0.000 {numpy.core.multiarray.array}
5349 0.065 0.000 0.567 0.000 series.py:709(_get_values)
985/1 0.063 0.000 1.847 1.847 groupby.py:608(apply)
5349 0.056 0.000 0.198 0.000 _methods.py:42(_mean)
5358 0.050 0.000 0.232 0.000 index.py:332(__getitem__)
8309 0.049 0.000 0.228 0.000 series.py:3299(_sanitize_array)
9296 0.047 0.000 0.116 0.000 index.py:1341(__new__)
984 0.039 0.000 0.092 0.000 algorithms.py:105(factorize)
按分组对DataFrame行进行分组。对于每个分组,对于每一行,按那些相同的值分组(即,所有具有值3而不是全部具有值4)。对于值分组中的每个索引,在dgs
中查找相应的索引并进行平均。然后是行分组的平均值。
::呼出::
任何关于如何重新安排速度的建议都将受到赞赏。
答案 0 :(得分:5)
您可以通过一个多级组来执行apply和groupby,这是代码:
import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed
seed(42)
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
print d.groupby(grps).apply(h) #Slow
### my code starts from here ###
def group_process(df2):
s = df2.stack()
v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)
print d.groupby(grps).apply(group_process)
输出:
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
它快了大约70倍,但我不知道它是否适用于10 ** 7行。