UDF用于在spark scala中添加数组列

时间:2018-04-11 18:25:50

标签: arrays scala apache-spark apache-spark-sql spark-dataframe

我们要求添加总和和比例如下。

Input File:Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30")) 

我们需要输出为比率:

step 1:
=======
a:25+30+25+30  ==> a:110
b:30 + 30      ==> b:60

step 2:
=======
a=a/a+b  ==>a:110/170
b=b/a+b ==>b:60/170

到目前为止,我试过了这个:

val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))
val res=a.flatMap(a=>a.map(x=>x.split(":")))
val res1=res.map(y => (y(0).asInstanceOf[String],(y(1).toDouble.asInstanceOf[Double]))).groupBy(_._1).map(x=>(x._1, x._2.map(_._2).sum)).toArray

输入文件或数据框:

 [54,WrappedArray(
    [WrappedArray(BCD001:10.0, BCD006:20.0),
    WrappedArray(BCD003:10.0, BCD006:30.0)],
    [WrappedArray(BCD005:50.0, BCD006:10.0),
    WrappedArray(BCD003:70.0, BCD006:0.0)])]

ouput file or dataframe: after 
adding all the BCD code values and ratios per bcd

eg. in record1 sum = 10+20+10+30+50+10+70+0= 210
ratio per BCD code = 10/210 = 0.50`

输出文件:

[54,WrappedArray([BCD001:0.5,
BCD006:0.1,BCD003:0.4,BCD005:0.25])]

2 个答案:

答案 0 :(得分:1)

您的数据结构似乎不像序列序列

val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))

但更像是序列元组的序列(你可以在spark中用printSchema验证)

val a = Seq((Seq("BCD001:10.0", "BCD006:20.0"),Seq("BCD003:10.0", "BCD006:30.0")),
  (Seq("BCD005:50.0", "BCD006:10.0"),Seq("BCD003:70.0", "BCD006:0.0")))

在这种情况下你需要像:

def parse(sq:Seq[String])=sq.map(x=>{val y=x.split(":")
  (y.head,y.last.toDouble)})
val res=a.flatMap(a=>Seq(parse(a._1),parse(a._2))).flatten.groupBy{case (k,_)=>k}
  .map{case (k,vs)=>(k,vs.foldLeft(0.0){case (t,(_,v))=>t+v})}
val tot=res.values.sum
res.map{case (k,v)=> s"$k:${v/tot}"}.toArray

导致:

res0: Array[String] = Array(BCD006:0.3, BCD003:0.4, BCD005:0.25, BCD001:0.05)

答案 1 :(得分:1)

您可以执行以下操作

val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))

val res = a.flatMap(a=>a.map(x=>{
  val splitted = x.split(":")
  (splitted(0).trim, splitted(1).trim.toInt)
}))
  .groupBy(_._1)
  .map(x => (x._1, x._2.map(_._2).sum))

val total = res.values.sum

res.map(x => (x._1, x._2+"/"+total))
  .foreach(println)

应该给你

(b,60/170)
(a,110/170)

如果您不想要字符串值,可以

res.map(x => (x._1, x._2.toDouble/total))
  .foreach(println)

应该给出

(b,0.35294117647058826)
(a,0.6470588235294118)