Search boolean matrix using pyspark

时间:2017-11-13 06:46:32

标签: python apache-spark pyspark spark-dataframe pyspark-sql

I have a boolean matrix of M x N, where M = 6000 and N = 1000

1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
  V
6000

Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.

Now the code I have is

    sig_matrix = list()
    num_columns = df.columns
    for col_name in num_columns:
        print('Processing column {}'.format(col_name))
        sig_index = df.filter(df[col_name] == 1).\
                    select('perm').limit(1).collect()[0]['perm']
        sig_matrix.append(sig_index)

Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.

2 个答案:

答案 0 :(得分:1)

这是一个为我运行< 1s的numpy版本,所以对于这个大小的数据应该更合适:

arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]

可能会有更高效的numpy解决方案。

答案 1 :(得分:0)

我最终使用numpy解决了我的问题。我就是这样做的。

import numpy as np

sig_matrix = list()
    columns = list(df)
    for col_name in columns:
        sig_index = np.argmax(df[col_name]) + 1
        sig_matrix.append(sig_index)

由于我的列中的值为0和1,argmax将返回第一次出现的值1.