I have a boolean matrix of M x N, where M = 6000 and N = 1000
1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
V
6000
Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1
.
Now the code I have is
sig_matrix = list()
num_columns = df.columns
for col_name in num_columns:
print('Processing column {}'.format(col_name))
sig_index = df.filter(df[col_name] == 1).\
select('perm').limit(1).collect()[0]['perm']
sig_matrix.append(sig_index)
Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.
答案 0 :(得分:1)
这是一个为我运行< 1s的numpy版本,所以对于这个大小的数据应该更合适:
arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]
可能会有更高效的numpy解决方案。
答案 1 :(得分:0)
我最终使用numpy解决了我的问题。我就是这样做的。
import numpy as np
sig_matrix = list()
columns = list(df)
for col_name in columns:
sig_index = np.argmax(df[col_name]) + 1
sig_matrix.append(sig_index)
由于我的列中的值为0和1,argmax将返回第一次出现的值1.