Pyspark-在火花数据框中为每一行计数非零列

时间:2019-05-02 01:13:27

标签: pyspark

我有数据框,我需要在Pyspark中逐行计算非零列的数量。

ID COL1 COL2 COL3
1  0    1    -1 
2  0    0     0 
3 -17   20    15
4  23   1     0

预期输出:

ID COL1 COL2 COL3 Count
1   0    1    -1    2
2   0    0     0    0
3  -17   20    15   3
4   23   1     0    1

1 个答案:

答案 0 :(得分:4)

有多种方法可以实现此目的,下面我列出一种简单的方法-

df = sqlContext.createDataFrame([
    [1,  0,    1,    -1], 
    [2,  0,    0,     0],
    [3, -17,   20,    15],
    [4,  23,   1,     0]], 
    ["ID", "COL1", "COL2", "COL3"]
)

#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']

#import functions
from pyspark.sql import functions as F

#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
    "count",
    sum([
            F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
    ])
).show()


+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
|  1|   0|   1|  -1|    2|
|  2|   0|   0|   0|    0|
|  3| -17|  20|  15|    3|
|  4|  23|   1|   0|    2|
+---+----+----+----+-----+