我们要求添加总和和比例如下。
Input File:Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))
我们需要输出为比率:
step 1:
=======
a:25+30+25+30 ==> a:110
b:30 + 30 ==> b:60
step 2:
=======
a=a/a+b ==>a:110/170
b=b/a+b ==>b:60/170
到目前为止,我试过了这个:
val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))
val res=a.flatMap(a=>a.map(x=>x.split(":")))
val res1=res.map(y => (y(0).asInstanceOf[String],(y(1).toDouble.asInstanceOf[Double]))).groupBy(_._1).map(x=>(x._1, x._2.map(_._2).sum)).toArray
输入文件或数据框:
[54,WrappedArray(
[WrappedArray(BCD001:10.0, BCD006:20.0),
WrappedArray(BCD003:10.0, BCD006:30.0)],
[WrappedArray(BCD005:50.0, BCD006:10.0),
WrappedArray(BCD003:70.0, BCD006:0.0)])]
ouput file or dataframe: after
adding all the BCD code values and ratios per bcd
eg. in record1 sum = 10+20+10+30+50+10+70+0= 210
ratio per BCD code = 10/210 = 0.50`
[54,WrappedArray([BCD001:0.5,
BCD006:0.1,BCD003:0.4,BCD005:0.25])]
答案 0 :(得分:1)
您的数据结构似乎不像序列序列
val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))
但更像是序列元组的序列(你可以在spark中用printSchema验证)
val a = Seq((Seq("BCD001:10.0", "BCD006:20.0"),Seq("BCD003:10.0", "BCD006:30.0")),
(Seq("BCD005:50.0", "BCD006:10.0"),Seq("BCD003:70.0", "BCD006:0.0")))
在这种情况下你需要像:
def parse(sq:Seq[String])=sq.map(x=>{val y=x.split(":")
(y.head,y.last.toDouble)})
val res=a.flatMap(a=>Seq(parse(a._1),parse(a._2))).flatten.groupBy{case (k,_)=>k}
.map{case (k,vs)=>(k,vs.foldLeft(0.0){case (t,(_,v))=>t+v})}
val tot=res.values.sum
res.map{case (k,v)=> s"$k:${v/tot}"}.toArray
导致:
res0: Array[String] = Array(BCD006:0.3, BCD003:0.4, BCD005:0.25, BCD001:0.05)
答案 1 :(得分:1)
您可以执行以下操作
val a = Array(Array("a:25","a:30","b:30"),Array("a:25","a:30","b:30"))
val res = a.flatMap(a=>a.map(x=>{
val splitted = x.split(":")
(splitted(0).trim, splitted(1).trim.toInt)
}))
.groupBy(_._1)
.map(x => (x._1, x._2.map(_._2).sum))
val total = res.values.sum
res.map(x => (x._1, x._2+"/"+total))
.foreach(println)
应该给你
(b,60/170)
(a,110/170)
如果您不想要字符串值,可以
res.map(x => (x._1, x._2.toDouble/total))
.foreach(println)
应该给出
(b,0.35294117647058826)
(a,0.6470588235294118)