Spark的RDD api新手 - 感谢Spark migrate sql window function to RDD for better performance - 我成功生成了下表:
+-----------------+---+
| _1| _2|
+-----------------+---+
| [col3TooMany,C]| 0|
| [col1,A]| 0|
| [col2,B]| 0|
| [col3TooMany,C]| 1|
| [col1,A]| 1|
| [col2,B]| 1|
|[col3TooMany,jkl]| 0|
| [col1,d]| 0|
| [col2,a]| 0|
| [col3TooMany,C]| 0|
| [col1,d]| 0|
| [col2,g]| 0|
| [col3TooMany,t]| 1|
| [col1,A]| 1|
| [col2,d]| 1|
| [col3TooMany,C]| 1|
| [col1,d]| 1|
| [col2,c]| 1|
| [col3TooMany,C]| 1|
| [col1,c]| 1|
+-----------------+---+
初始输入
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
import org.apache.spark.util.StatCounter
然后
long.as[((String, String), Int)].rdd.aggregateByKey(StatCounter())(_ merge _, _ merge _).collect.head
res71: ((String, String), org.apache.spark.util.StatCounter) = ((col2,B),(count: 3, mean: 0,666667, stdev: 0,471405, max: 1,000000, min: 0,000000))
汇总了每列的所有唯一值的统计信息。
如何添加count
(3
中B
col2
)第二个计数(可能是一个元组),表示{的数量{ {1}} B
col2
。在这种情况下,它应该是TARGET == 1
。
答案 0 :(得分:2)
这里你不需要额外的聚合。对于二进制target
列,mean
只是target
等于1的经验概率:
count
* mean
count
*(1 - mean
)