如何高效地获得矩阵中的一种“最大值”

时间:2019-03-06 23:33:11

标签: python-3.x pandas performance matrix iteration

我遇到以下问题:我用pandas模块打开了一个矩阵,其中每个单元格的数字都在-1和1之间。我想找到的是一行中最大的“可能”值也不是另一行中的最大值。

例如,如果两行在同一列具有最大值,则我将两个值进行比较并取较大值,然后对于最大值小于另一行的行,取第二个最大值(并且一次又一次地进行相同的分析。

为了说明自己,最好考虑一下我的代码

import pandas as pd

matrix = pd.read_csv("matrix.csv") 
# this matrix has an id (or name) for each column 
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])

l = len(matrix.col[[0]]) # number of columns

while next = 1:
   next = 0
   for i in range(0, len(matrix)):
       max_column = str(0)
       for j in range(1, l): # 1 because the first column is an id
           if matrix[max_column][i] < matrix[str(j)][i]:
               max_column = str(j)
       results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
       results['id2'][i] = max_column
       results['max_pos'][i] = matrix[max_column][i]

   for i in range(0, len(results)): #now I will check if two or more rows have the same max column
       for ii in range(0, len(results)):
       # if two id1 has their max in the same column, I keep it with the biggest 
       # ... max value and chage the other to "-1" to iterate again
           if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
               matrix[results['id2'][i]][i] = -1
               next = 1

举个例子:

#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  4  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#at the first iterarion I will have the following result

0  b  4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1  b  5
2  a  5
3  c  2

#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:

0  c  3
1  b  5
2  a  5
3  c  2

#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3 

0  c  3
1  b  5
2  a  5
3  d  1

#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.

#what if the matrix would be 
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  5  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#then, at the first itetarion the result will be:

0  b  5
1  b  5
2  a  5
3  c  2

#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating 
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...

例如,如果我有一个100x100的矩阵,那么此代码对我来说是完美的。但是,如果矩阵大小达到50,000x50,000,则代码需要花费大量时间才能完成。现在,我的代码可能是最无效率的方式,但是我不知道如何处理。

我一直在阅读python中的线程可能会有所帮助,但是如果我放置50,000个线程却无济于事,因为我的计算机没有使用更多的CPU。我也尝试过将某些功能用作.max(),但是我无法获取max的列并将其与其他max进行比较...

如果有人能帮助我给我一些建议,以提高效率,我将非常感激。

1 个答案:

答案 0 :(得分:1)

需要更多有关此的信息。您想在这里完成什么?

这将帮助您获得一些帮助,但是为了完全实现您正在做的事情,我需要更多的上下文。

我们将从集合中导入numpy,random和Counter:

import numpy as np
import random 
from collections import Counter

我们将创建一个随机的50k x 50k矩阵,其数字介于-10M和+ 10M之间

mat = np.random.randint(-10000000,10000000,(50000,50000))

现在要获取每个的最大值,我们可以执行以下列表理解:

maximums = [max(mat[x,:]) for x in range(len(mat))]

现在,我们要找出在其他任何行中哪些不是最大值。我们可以在最大值列表上使用Counter来找出每个数量。 Counter返回一个计数器对象,该对象就像一个字典,其中的最大值作为键,而其次数显示为值。 然后,我们进行字典解析,其中值==等于1。这将使我们只显示一次的最大值。我们使用.keys()函数本身来获取数字,然后将其转换为列表。

c = Counter(maximums)
{9999117: 15,
9998584: 2,
9998352: 2,
9999226: 22,
9999697: 59,
9999534: 32,
9998775: 8,
9999288: 18,
9998956: 9,
9998119: 1,
...}

k = list( {x: c[x] for x in c if c[x] == 1}.keys() )

[9998253,
 9998139,
 9998091,
 9997788,
 9998166,
 9998552,
 9997711,
 9998230,
 9998000,
...]

最后,我们可以执行以下列表理解,以遍历原始最大值列表以获取这些行的位置的指示。

indices = [i for i, x in enumerate(maximums) if x in k]

根据您要查找的内容,我们可以从这里开始。

这不是最快的程序,而是在已加载的50,000 x 50,000矩阵上找到最大值,计数器和指数需要182秒。