如何向量化(使用pandas / numpy)而不是使用嵌套的for循环

时间:2018-06-20 14:44:39

标签: python pandas numpy vectorization

我希望有效地使用pandas(或numpy)代替带有for语句的嵌套if循环来解决特定问题。这是一个玩具版本:

假设我有以下两个数据框

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

现在,我希望遍历每个数据帧的每一行,并在满足特定条件的情况下乘以val。这段代码可以满足我的需求

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

但是,很明显,这是非常的效率低下,实际上我的DataFrames可能有数百万个条目,因此无法使用。我也没有让我们使用pandasnumpy高效的向量实现。有谁知道如何有效地向量化此嵌套循环?

我觉得这段代码类似于矩阵乘法,因此可以利用outer取得进步吗?我发现很难进入if条件,因为if逻辑需要将df1中的每个条目与df2中的所有条目进行比较。

4 个答案:

答案 0 :(得分:4)

您还可以使用Numba之类的编译器来完成此工作。这也将胜过矢量化解决方案,并且不需要临时数组。

示例

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

测试

dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

#first call has some compilation overhead
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)

t1=time.time()
for i in range(1000):
  res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
  #res_2=g(df1, df2)

print(time.time()-t1)

时间

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms

答案 1 :(得分:3)

m1 = np.less_equal.outer(df1['in'], df2['out']) 
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())

或者,用此长行替换前三行:

m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()

对于非常大的问题,建议使用dask

计时测试:

这是使用1000和1500长数组时的时序比较:

In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
     ...: df1 = pd.DataFrame(data=dict1)
     ...: 
     ...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
     ...: df2 = pd.DataFrame(data=dict2)

作者的原始方法(Python循环):

In [167]: def f(df1, df2):
     ...:     ans = []
     ...:     for i in range(len(df1)):
     ...:         for j in range(len(df2)):
     ...:             if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
     ...:                 ans.append(df1['vals'][i]*df2['vals'][j])
     ...:     return np.sum(ans)
     ...: 
     ...: 

In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ Ben.T方法:

In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
     ...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

此处提出的矢量化解决方案:

In [171]: def g(df1, df2):
     ...:     m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
     ...:     return np.outer(df1['vals'], df2['vals'])[m].sum()
     ...: 
     ...: 

In [172]: %timeit g(df1, df2)

7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 2 :(得分:2)

您的答案:

471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

方法1(慢3倍以上):

df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()

1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

方法2(慢2倍以上):

ans = []
for name, row in df1.iterrows():
    _in = row['in']
    _out = row['out']
    _vals = row['vals']
    ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)

1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

方法3(快3倍以上):

df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
    df2_vals = df2.values
    df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
    ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]

144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

在方法3中,您可以通过执行以下操作来查看解决方案:

ans[ans.nonzero()]

Out[]: array([ 50000.,  80000., 160000.,  60000.]

我想不出一种消除底层循环的方法:(但是我在此过程中学到了很多有关numpy的知识!(可以学习)

答案 3 :(得分:-1)

一种方法是使用apply。在df2中创建一列,其中包含df1中的值之和,满足您的输入和输出条件,然后乘以df2行中的值。

df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & 
                                              (df1['out'] >= row['in'])].sum()*row['vals'],1)

然后将这一列加起来

df2['ans'].sum()