根据索引和列将两个熊猫数据框相乘

时间:2021-02-08 15:16:21

标签: python pandas dataframe optimization

我有两个具有以下结构的数据框:

id           0          1         2           
time      0    1     0    1   0     1            
id time                                            id    0      1     2                 
0  0      a1   a2    b1   b2   c1   c2             id 
   1      a3   a4    b3   b4   c3   c4             0     w00   w01   w02 
1  0      d1   d2    d1   d2   e1   e2     and     1     w10   w11   w12  
   1      d3   d4    d3   d4   e3   e4             2     w20   w21   w22  
2  0      f1   f2    g1   g2   h1   h2            
   1      f3   f4    g3   g4   h3   h4            

我需要获得一个矩阵序列,以便第一个由其 id 索引的 DataFrame 的每个元素都必须乘以由相同 id 索引的第二个 DataFrame 的相应元素,即:

id               0           |    id                1           |    id              2  
time         0       1       |    time          0       1       |    time        0       1 
id time                      |    id time                       |    id time
0  0      a1*w00   a2*w00    |    0  0       b1*w01   b2*w01    |    0  0     c1*w02   c2*w02
   1      a3*w00   a4*w00    |       1       b3*w01   b4*w01    |       1     c3*w02   c4*w02

等等。我当前的当前实现如下所示,但它需要很长时间,样本大小仅为 200 和 3 个时间段(我需要重复数百次),所以我想知道是否有一种方法可以矢量化/优化这个。我不知道这是否重要,但最终目标是将获得的每个矩阵的所有元素相加。

import numpy as np
import pandas as pd

N = 3
T = 2
NT = N*T

# JUST GENERATING FAKE DATA
ind = []
for i in range(N):
    for t in range(T):
        i_t = (i,t)
        ind.append(i_t)
        
index2 = pd.MultiIndex.from_tuples(ind)

eps1 = np.random.randint(1,10,(NT,1))
eps2 = np.random.randint(1,10,(NT,1))
df1 = pd.DataFrame(eps1.dot(eps2.transpose()), index=index2, columns=index2)

w = np.random.normal(0, 1, size=(N,1))
df2 = pd.DataFrame(w.dot(w.transpose()))

E = pd.DataFrame(index=range(N), columns=range(N))

# THIS IS WHAT I NEED TO VECTORIZE/OPTIMIZE
for i in range(N):
     for j in range(N):
        E.loc[i][j] = (df1.loc[i][j] * df2.loc[i][j]).to_numpy().sum()
        
E

1 个答案:

答案 0 :(得分:0)

试试:

(df1.mul(df2, level=0)       # multiply two df, align by level 0
    .sum(level=0)            # sum along columns, align by level 0
    .sum(axis=1, level=0)    # sum along rows, aling by level 0
)