我需要多次独立地对每一行大型pandas DataFrame进行洗牌(典型形状为(10000,1000)
),然后估计每一行与给定系列的相关性。
我发现留在熊猫中的最有效(=快速)方式如下:
for i in range(N): #the larger is N, the better it is
df_sh = df.apply(numpy.random.permutation, axis=1)
#where df this is my large dataframe, with 10K rows and 1K columns
corr = df_sh.corrwith(s, axis = 1)
#where s is the provided series (shape of s =(1000,))
这两项任务花费大约相同的时间(即每次30秒)。我尝试将我的数据帧转换为numpy.array
,在数组上执行for
循环,对于每一行,我首先执行排列,然后测量与scipy.stats.pearsonr
的相关性。不幸的是,我设法将我的两项任务加速了2倍。
还有其他可行的选择来加速任务吗? (注意:我已经将Joblib
代码的执行并行化到我正在使用的机器所允许的最大因子。
答案 0 :(得分:2)
我们可以调整corr2_coeff_rowwise
以获得2D
数组/矩阵与1D数组/向量之间的相关性,如此 -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
要对每行进行随机播放并对所有行执行此操作,我们可以使用np.random.shuffle
。现在,这个shuffle函数沿着第一个轴工作。因此,为了解决我们的问题,我们需要提供转置版本。另外,请注意,这种改组将在原地完成。因此,如果其他地方需要原始数据帧,请在处理之前制作副本。因此,解决方案是 -
因此,让我们用它来解决我们的案例 -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
现在,我们可以预先计算涉及在迭代之间保持相同的系列的部分。因此,进一步优化的版本将如下所示 -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
使用实际用例形状/尺寸设置输入
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
<强> 1。原始方法
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
<强> 2。建议的方法
预处理部分(仅在开始循环之前完成一次,因此不包括在时间中) -
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
在循环中运行的建议解决方案的一部分 -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
因此,我们在这里看到 3x
的加速!