我想以有效的方式向pandas切片添加值,因为这个函数经常被调用。结构如下所示:
import pandas as pd
import numpy as np
names = ["a", "b", "c", "d", "e", "f"]
mat = pd.DataFrame(0.0, index=names, columns=names)
# now comes the `tricky' part
positive_instances = ["a", "e", "c"]
negative_instances = ["d", "b", "f"]
p_mat = np.array([[1.,2.],[3.,4.]])
mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
所需的新矩阵mat
如下所示:
mat =
a b c d e f
a 1 2 1 2 1 2
b 3 4 3 4 3 4
c 1 2 1 2 1 2
d 3 4 3 4 3 4
e 1 2 1 2 1 2
f 3 4 3 4 3 4
注释下面的结构嵌入到for循环中。有几个不同的积极和消极的例子。 要添加数据结构:
positive_instances
和negative_instances
总是不相交,不需要长度相同positive_instances
和negative_instances
的联合始终为names
positive_instances
始终位于0
的索引p_mat
,而negative_instances
始终位于索引1
。 我认为有更有效的方法来实现目标。任何帮助将不胜感激。
编辑:更正了代码中的变量名称并添加了所需的输出。
Edit2:添加了有关positive_instances
和negative_instances
答案 0 :(得分:2)
我们可以在这里使用NumPy,使用带有np.ix_
的广播索引有效地将值分配到数组中,从而模拟与Pandas中的.loc[row,col]
相同的行为。完成赋值后,我们将创建输出数据帧。
因此,实现将是这样的 -
sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]
n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]
df = pd.DataFrame(arr, index=names, columns=names)
运行时测试 -
方法:
def func0(p_mat, names, positive_instances, negative_instances):
mat = pd.DataFrame(0.0, index=names, columns=names)
mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
return mat
def func1(p_mat, names, positive_instances, negative_instances):
sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]
n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]
df = pd.DataFrame(arr, index=names, columns=names)
return df
计时 -
In [109]: names = ["a", "f", "d","b", "c", "e"]
...:
...: # now comes the `tricky' part
...: positive_instances = ["a", "e", "c"]
...: negative_instances = ["d", "b", "f"]
...:
...: p_mat = np.array([[1.,2.],[3.,4.]])
...:
In [110]: %timeit func0(p_mat, names, positive_instances, negative_instances)
100 loops, best of 3: 4.87 ms per loop
In [111]: %timeit func1(p_mat, names, positive_instances, negative_instances)
10000 loops, best of 3: 189 µs per loop
In [112]: 4870.0/189
Out[112]: 25.767195767195766
25x+
加速!