我遇到以下问题:我用pandas
模块打开了一个矩阵,其中每个单元格的数字都在-1和1之间。我想找到的是一行中最大的“可能”值也不是另一行中的最大值。
例如,如果两行在同一列具有最大值,则我将两个值进行比较并取较大值,然后对于最大值小于另一行的行,取第二个最大值(并且一次又一次地进行相同的分析。
为了说明自己,最好考虑一下我的代码
import pandas as pd
matrix = pd.read_csv("matrix.csv")
# this matrix has an id (or name) for each column
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])
l = len(matrix.col[[0]]) # number of columns
while next = 1:
next = 0
for i in range(0, len(matrix)):
max_column = str(0)
for j in range(1, l): # 1 because the first column is an id
if matrix[max_column][i] < matrix[str(j)][i]:
max_column = str(j)
results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
results['id2'][i] = max_column
results['max_pos'][i] = matrix[max_column][i]
for i in range(0, len(results)): #now I will check if two or more rows have the same max column
for ii in range(0, len(results)):
# if two id1 has their max in the same column, I keep it with the biggest
# ... max value and chage the other to "-1" to iterate again
if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
matrix[results['id2'][i]][i] = -1
next = 1
举个例子:
#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 4 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#at the first iterarion I will have the following result
0 b 4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1 b 5
2 a 5
3 c 2
#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:
0 c 3
1 b 5
2 a 5
3 c 2
#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3
0 c 3
1 b 5
2 a 5
3 d 1
#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.
#what if the matrix would be
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 5 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#then, at the first itetarion the result will be:
0 b 5
1 b 5
2 a 5
3 c 2
#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...
例如,如果我有一个100x100的矩阵,那么此代码对我来说是完美的。但是,如果矩阵大小达到50,000x50,000,则代码需要花费大量时间才能完成。现在,我的代码可能是最无效率的方式,但是我不知道如何处理。
我一直在阅读python中的线程可能会有所帮助,但是如果我放置50,000个线程却无济于事,因为我的计算机没有使用更多的CPU。我也尝试过将某些功能用作.max()
,但是我无法获取max的列并将其与其他max进行比较...
如果有人能帮助我给我一些建议,以提高效率,我将非常感激。
答案 0 :(得分:1)
需要更多有关此的信息。您想在这里完成什么?
这将帮助您获得一些帮助,但是为了完全实现您正在做的事情,我需要更多的上下文。
我们将从集合中导入numpy,random和Counter:
import numpy as np
import random
from collections import Counter
我们将创建一个随机的50k x 50k矩阵,其数字介于-10M和+ 10M之间
mat = np.random.randint(-10000000,10000000,(50000,50000))
现在要获取每个行的最大值,我们可以执行以下列表理解:
maximums = [max(mat[x,:]) for x in range(len(mat))]
现在,我们要找出在其他任何行中哪些不是最大值。我们可以在最大值列表上使用Counter
来找出每个数量。 Counter返回一个计数器对象,该对象就像一个字典,其中的最大值作为键,而其次数显示为值。
然后,我们进行字典解析,其中值==等于1。这将使我们只显示一次的最大值。我们使用.keys()
函数本身来获取数字,然后将其转换为列表。
c = Counter(maximums)
{9999117: 15,
9998584: 2,
9998352: 2,
9999226: 22,
9999697: 59,
9999534: 32,
9998775: 8,
9999288: 18,
9998956: 9,
9998119: 1,
...}
k = list( {x: c[x] for x in c if c[x] == 1}.keys() )
[9998253,
9998139,
9998091,
9997788,
9998166,
9998552,
9997711,
9998230,
9998000,
...]
最后,我们可以执行以下列表理解,以遍历原始最大值列表以获取这些行的位置的指示。
indices = [i for i, x in enumerate(maximums) if x in k]
根据您要查找的内容,我们可以从这里开始。
这不是最快的程序,而是在已加载的50,000 x 50,000矩阵上找到最大值,计数器和指数需要182秒。