我有一个RDD,如:
("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B"))
我想把它变成:
转型1:
("G1", 1200, "111.222.0|555.666.0|777.888.0)
转型2:
("G1", 1200, 111, 222, 0, "B")
("G1", 1200, 555, 666, 0, "F")
("G1", 1200, 777, 888, 0, "B")
将转换视为彼此独立。
答案 0 :(得分:0)
您可以执行以下操作
假设您已将rdd设为
scala> val rdd = sc.parallelize(Seq(("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B")))))
rdd: org.apache.spark.rdd.RDD[(String, Int, List[(Int, Int, Int, String)])] = ParallelCollectionRDD[0] at parallelize at <console>:24
第一次转型是
scala> val t1 = rdd.map(row => (row._1, row._2, row._3.mkString("|").replace("(", "").replace(")", "")))
t1: org.apache.spark.rdd.RDD[(String, Int, String)] = MapPartitionsRDD[1] at map at <console>:26
scala> t1.foreach(println)
[Stage 0:> (0 + 0) / 4](G1,1200,111,222,0,B|555,666,0,F|777,888,0,B)
你的第二次转型可以是
scala> val t2 = rdd.map(row => row._3.map(x => (row._1, row._2, x))).flatMap(x => x)
t2: org.apache.spark.rdd.RDD[(String, Int, (Int, Int, Int, String))] = MapPartitionsRDD[3] at flatMap at <console>:26
scala> t2.foreach(println)
(G1,1200,(111,222,0,B))
(G1,1200,(555,666,0,F))
(G1,1200,(777,888,0,B))
答案 1 :(得分:0)
这是获得所需内容的一种方法:
val rdd = sc.parallelize(Seq(
("G1", 1200, List((111, 222, 0, "B"), (555, 666, 0, "F"), (777, 888, 0, "B"))),
("G2", 2400, List((222, 444, 0, "A"), (444, 666, 0, "C"), (777, 999, 0, "C")))
))
val rdd1 = rdd.map{ case (x, y, z) => (x, y, z.map(
a => Seq(a._1, a._2, a._3).mkString(".")
).mkString("|")
) }
rdd1.collect
// res1: Array[(String, Int, String)] = Array(
// (G1,1200,111.222.0|555.666.0|777.888.0),
// (G2,2400,222.444.0|444.666.0|777.999.0)
// )
val rdd2 = rdd.map{ case (x, y, z) => ((x, y), z) }.
flatMapValues(identity).
map { case ((x, y), z) => (x, y, z._1, z._2, z._3, z._4) }
rdd2.collect
// res2: Array[(String, Int, Int, Int, Int, String)] = Array(
// (G1,1200,111,222,0,B), (G1,1200,555,666,0,F), (G1,1200,777,888,0,B),
// (G2,2400,222,444,0,A), (G2,2400,444,666,0,C), (G2,2400,777,999,0,C)
// )