我对Scala还是很陌生,来自R背景,发现很难进行一些循环活动来了解异常值并将其替换为99或1个百分位数 下面是查询
如果我具有以下数据框
val input_val = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1),
("e",1 ,2, 3, 4, 6)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val Array(a1,a2) = input_val.stat.approxQuantile("var1",Array(0.01,0.99),0.0)
val P1 = a1
val P2 = a2
println(a1)
println(a2)
val Array(a11,a22) = input_val.stat.approxQuantile("var2",Array(0.01,0.99),0.0)
val P1_1 = a11
val P2_1 = a22
val Outlier_remover = input_val.withColumn("OutlierExc"+lit("var1"),when( input_val("var1") < P1,P1)
.when(input_val("var1") > P2,P2)
.otherwise(input_val("var1")))
.withColumn("OutlierExc"+lit("var2"),when( input_val("var2") < P1_1,P1_1)
.when(input_val("var2") > P2_1,P2_1)
.otherwise(input_val("var2"))).show()
但是我需要对45列进行此操作,因为它们是动态实现的