在Spark RDD中展平元组数组

时间:2017-04-29 17:09:08

标签: scala apache-spark

我有一个类型为

的配对RDD
Array[((String, String), ((String, String, String, String, String), (Double, Double)))]

E.g: -

scala> joinWD.collect

res75: Array[((String, String), ((String, String, String, String, String), (Double, Double)))] = Array(((82010200-01,2008),((Acorn Lake,Washington,Lower St. Croix River,-92.97171054,45.01655642),(1.0413333177566528,0.04000000283122063))), 
((82010200-01,2008),((Acorn Lake,Washington,Lower St. Croix River,-92.97171054,45.01655642),(1.0413333177566528,0.04000000283122063)))]

我想将其展平为A rray[(String, String),String, String, String, String, String, Double, Double]。第一个元组是键,所有其他元素都是值。

我们如何使用Spark / Scala展平它?

1 个答案:

答案 0 :(得分:1)

据我所知,没有flatten元组的方法(除非你使用shapeless),所以map可能看起来不太开心:​​

val myArr: Array[((String, String), ((String, String, String, String, String), (Double, Double)))] = Array(
  (("82010200-01", "2008"), (("Acorn Lake", "Washington", "Lower St. Croix River", "-92.97171054", "45.01655642"), (1.0413333177566528, 0.04000000283122063))),
  (("82010200-01", "2008"), (("Acorn Lake", "Washington", "Lower St. Croix River", "-92.97171054", "45.01655642"), (1.0413333177566528, 0.04000000283122063)))
)

myArr.map{ case (k, (u, v)) => (k, u._1, u._2, u._3, u._4, u._5, v._1, v._2) }

res1: Array[((String, String), String, String, String, String, String, Double, Double)] = Array(
  ((82010200-01, 2008), Acorn Lake, Washington, Lower St. Croix River, -92.97171054, 45.01655642, 1.0413333177566528, 0.04000000283122063),
  ((82010200-01, 2008), Acorn Lake, Washington, Lower St. Croix River, -92.97171054, 45.01655642, 1.0413333177566528, 0.04000000283122063)
)