我有一种情况,在这种情况下,我在一行的每一列中都有非零的计数。
DataFrame:
subaccid|srp0|srp1|srp2|srp3|srp4|srp5|srp6|srp7|srp8|srp9|srp10|srp11|srp12
+-------+----+----+----+----+----+----+------+----+----+----+-----+-----+--+
AAA |0.0 |12.0|12.0|0.0 |0.0 |0.0 |10.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0
AAB |12.0|12.0|12.0|10.0|12.0|12.0|12.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0
AAC |10.0|12.0|0.0 |0.0 |0.0 |10.0|10.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0
ZZZ |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |-110.0|0.0 |0.0 |0.0 |0.0 |0.0 |0.0
+-------+----+----+----+----+----+----+------+----+----+----+-----+-----+--+
输出:
subaccid,count of nonzeros
AAA,2
AAB,7
AAC,4
ZZZ,1
答案 0 :(得分:1)
这也是可行的,没有RDD的东西,我自己的数据:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.4, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
val count_non_zero = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_ + _)
df.withColumn("non_zero_count", count_non_zero).show(false)
返回:
+---+---+---+---+---+--------------+
|ID |a |b |c |d |non_zero_count|
+---+---+---+---+---+--------------+
|r1 |0.0|0.0|0.0|0.0|4 |
|r2 |6.4|4.9|6.3|7.1|0 |
|r3 |4.2|0.0|7.2|8.4|1 |
|r4 |1.0|2.0|0.0|0.0|2 |
+---+---+---+---+---+--------------+
假设采用双精度/实数格式,否则我们会在Any issue中找到asInstanceOf。
您可以放下列或选择重物完成。
希望这会有所帮助。
答案 1 :(得分:0)
一个选项是:
//Create dataframe
val df = sc.parallelize(
Seq(("AAA", 0.0, 12.0,12.0,0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0),
("AAB", 12.0, 12.0, 12.0, 10.0, 12.0, 12.0, 12.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0),
("AAC", 10.0, 12.0, 0.0, 0.0, 0.0, 10.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0),
("ZZZ", 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 110.0,0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
)).toDF("subaccid","srp0","srp1","srp2","srp3","srp4","srp5","srp6","srp7","srp8","srp9","srp10","srp11","srp12")
val df2 = df.rdd.map(x => (x.getString(0), x.toSeq.tail.filter(_ != 0).length)).toDF("subaccid", "count")
df2.show
//output
+--------+-----+
|subaccid|count|
+--------+-----+
| AAA| 3|
| AAB| 7|
| AAC| 4|
| ZZZ| 1|
+--------+-----+
当然,这包括转换为rdd并返回。