在scala中转换RDD

时间:2017-07-12 10:09:56

标签: scala rdd

我有一个RDD,如:

("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B"))

我想把它变成:

转型1:

("G1", 1200, "111.222.0|555.666.0|777.888.0)

转型2:

("G1", 1200, 111, 222, 0, "B")
("G1", 1200, 555, 666, 0, "F")
("G1", 1200, 777, 888, 0, "B")

将转换视为彼此独立。

2 个答案:

答案 0 :(得分:0)

您可以执行以下操作

假设您已将rdd设为

scala> val rdd = sc.parallelize(Seq(("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B")))))
rdd: org.apache.spark.rdd.RDD[(String, Int, List[(Int, Int, Int, String)])] = ParallelCollectionRDD[0] at parallelize at <console>:24

第一次转型是

scala> val t1 = rdd.map(row => (row._1, row._2, row._3.mkString("|").replace("(", "").replace(")", "")))
t1: org.apache.spark.rdd.RDD[(String, Int, String)] = MapPartitionsRDD[1] at map at <console>:26

scala> t1.foreach(println)
[Stage 0:>                                                          (0 + 0) / 4](G1,1200,111,222,0,B|555,666,0,F|777,888,0,B)

你的第二次转型可以是

scala> val t2 = rdd.map(row => row._3.map(x => (row._1, row._2, x))).flatMap(x => x)
t2: org.apache.spark.rdd.RDD[(String, Int, (Int, Int, Int, String))] = MapPartitionsRDD[3] at flatMap at <console>:26

scala> t2.foreach(println)
(G1,1200,(111,222,0,B))
(G1,1200,(555,666,0,F))
(G1,1200,(777,888,0,B))

答案 1 :(得分:0)

这是获得所需内容的一种方法:

val rdd = sc.parallelize(Seq(
  ("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B"))),
  ("G2", 2400, List((222, 444, 0, "A"), (444, 666, 0, "C"), (777, 999, 0, "C")))
))

val rdd1 = rdd.map{ case (x, y, z) => (x, y, z.map(
    a => Seq(a._1, a._2, a._3).mkString(".")
  ).mkString("|")
) }

rdd1.collect
// res1: Array[(String, Int, String)] = Array(
//   (G1,1200,111.222.0|555.666.0|777.888.0),
//   (G2,2400,222.444.0|444.666.0|777.999.0)
// )

val rdd2 = rdd.map{ case (x, y, z) => ((x, y), z) }.
  flatMapValues(identity).
  map { case ((x, y), z) => (x, y, z._1, z._2, z._3, z._4) }

rdd2.collect
// res2: Array[(String, Int, Int, Int, Int, String)] = Array(
//   (G1,1200,111,222,0,B), (G1,1200,555,666,0,F), (G1,1200,777,888,0,B),
//   (G2,2400,222,444,0,A), (G2,2400,444,666,0,C), (G2,2400,777,999,0,C)
// )