熊猫矩阵计算直到对角线

时间:2020-06-24 10:31:31

标签: python pandas

我正在使用python中的熊猫进行矩阵计算。

我的原始数据采用字符串列表的形式(每行都是唯一的)。

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

我必须用一行计算所有其他行的得分

得分计算算法:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size

在ID 0和ID 1,2,3之间重复步骤2,3,对所有ID都类似。

创建N * N矩阵:

-  0    1    2  3
0  1    0.6  1  1
1  0.6  1    1  1 
2  1    1    1  1
3  1    1    1  1

目前,我正在使用熊猫假人方法来计算分数:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

但是在矩阵的对角线之后会重复计算,直到对角线为止的分数计算就足够了。例如:

ID 0的分数的计算,直到ID(row,column)(0,0),ID(row,column)(0,1),(0,2),(0,3 )可以从ID(row,column)(1,0),(2,0),(3,0)复制。

有关计算的详细信息: matrix sample 我需要计算直到对角线,即直到黄色框(矩阵的对角线),白色值已经在绿色阴影区域(用于参考)中计算了,我只需要转置即可绿色阴影区域变为白色。

我如何在熊猫中做到这一点?

3 个答案:

答案 0 :(得分:8)

首先,这里是您的代码概要分析。首先将所有命令分开,然后将其发布。

%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)

以上分析返回了以下结果:

Explode   : 1000 loops, best of 3: 201 µs per loop
Dummies   : 1000 loops, best of 3: 697 µs per loop
Sum       : 1000 loops, best of 3: 1.36 ms per loop
Dot       : 1000 loops, best of 3: 453 µs per loop
Sum2      : 10000 loops, best of 3: 162 µs per loop
Divide    : 100 loops, best of 3: 1.81 ms per loop

同时运行两条线会导致:

100 loops, best of 3: 5.35 ms per loop

使用另一种方法较少依赖熊猫(有时是昂贵的)功能,通过跳过上三角矩阵和对角线的计算,我创建的代码仅花费大约三分之一的时间。

import numpy as np

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
    d0 = set(df.iloc[i].list_of_value)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(df)):
        df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])

使用df作为

df = pd.DataFrame(
    [[['a','b','c']],
     [['d','b','c']],
     [['a','b','c']],
     [['a','b','c']]],
     columns = ["list_of_value"])

对该代码进行性能分析仅需要1.68ms的运行时间。

1000 loops, best of 3: 1.68 ms per loop

更新

与其选择对整个DataFrame进行操作,不如对整个DataFrame进行操作,都可以大大提高速度。

已经测试了三种遍历该系列条目的方法,并且在性能上它们几乎相等。

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))

# get the Series from the DataFrame
dfl = df.list_of_value

for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems():  # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

熊猫有很多陷阱。例如。始终通过df.iloc[0]而非df[0]访问DataFrame或Series的行。两者都可以,但是df.iloc[0]更快。

第一个具有4个元素(每个元素的大小为3)的矩阵的计时导致了大约3倍的加速。

1000 loops, best of 3: 443 µs per loop

当使用更大的数据集时,以超过11的加速比可以获得更好的结果:

# operating on the DataFrame
10 loop, best of 3: 565 ms per loop

# operating on the Series
10 loops, best of 3: 47.7 ms per loop

更新2

当完全不使用熊猫时(在计算过程中),您将获得另一个明显的加速。因此,您只需要将要转换的列转换为列表即可。

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

在问题中提供的数据上,与第一次更新相比,我们只会看到稍微更好的结果。

1000 loops, best of 3: 363 µs per loop

但是当使用更大的数据(100行,列表大小为15)时,优势显而易见:

100 loops, best of 3: 5.26 ms per loop

这里是所有建议方法的比较:

+----------+-----------------------------------------+
|          | Using the Dataset from the question     |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop  |
+----------+-----------------------------------------+
| Answer   | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop  |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop  |
+----------+-----------------------------------------+

答案 1 :(得分:3)

尽管这个问题得到了很好的回答,但我将展示一种更具可读性和效率的替代方法:

from itertools import product
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
         product(df['list_of_value'], repeat=2)))

pd.DataFrame(index=df['id'],
             columns=df['id'],
             data=np.array(values).reshape(len_df, len_df))

id         0         1         2         3
id                                        
0   1.000000  0.666667  1.000000  1.000000
1   0.666667  1.000000  0.666667  0.666667
2   1.000000  0.666667  1.000000  1.000000
3   1.000000  0.666667  1.000000  1.000000

%%timeit
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
         product(df['list_of_value'], repeat=2)))

pd.DataFrame(index=df['id'],
             columns=df['id'],
             data=np.array(values).reshape(len_df, len_df))

850 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
#convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

470 µs ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 2 :(得分:2)

尽管我确信它可能会更快,但我不愿意更改您的第一行,因为随着您的数据变大,这不会成为瓶颈。但是第二行可能是,并且也非常容易改进:

更改此:

s.dot(s.T).div(s.sum(1))

收件人:

arr=s.values
np.dot( arr, arr.T ) / arr[0].sum()

这只是用numpy而不是熊猫来完成,但是通常您会获得巨大的加速。在您的小型样本数据上,其速度只会提高2倍,但是如果将数据框从4行增加到400行,那么我看到的速度会提高20倍以上。

顺便说一句,我倾向于不担心问题的三角方面,至少就速度而言。您必须使代码复杂得多,并且在这种情况下甚至可能无法获得任何速度。

相反,如果节省存储空间很重要,那么显然仅保留上三角(或下三角)将使您的存储需求减少一半以上。

(如果您确实关心维的三角形方面,那么numpy确实具有相关的函数/方法,但是我不了解它们是副函数,同样,对于这种情况是否值得额外的复杂性,我还是不清楚。)