将一条记录转换为多条记录

时间:2017-09-28 21:15:42

标签: scala apache-spark

如果输入格式为

(x1,(a,b,c,List(key1, key2))
(x2,(a,b,c,List(key3))

我希望实现此输出

(key1,(a,b,c,x1))
(key2,(a,b,c,x1))
(key3,(a,b,c,x2))

以下是代码:

var hashtags = joined_d.map(x => (x._1, (x._2._1._1, x._2._2, x._2._1._4, getHashTags(x._2._1._4))))

var hashtags_keys = hashtags.map(x => if(x._2._4.size == 0) (x._1, (x._2._1, x._2._2, x._2._3, 0)) else
x._2._4.map(y => (y, (x._2._1, x._2._2, x._2._3, 1))))

函数getHashTags()返回一个列表。如果列表不为空,我们希望使用列表中的每个元素作为新键。我该如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

rdd创建为:

val rdd = sc.parallelize(
    Seq(
        ("x1",("a","b","c",List("key1", "key2"))), 
        ("x2", ("a", "b", "c", List("key3")))
    )
)

您可以像这样使用flatMap

rdd.flatMap{ case (x, (a, b, c, list)) => list.map(k => (k, (a, b, c, x))) }.collect
// res12: Array[(String, (String, String, String, String))] = 
//        Array((key1,(a,b,c,x1)), 
//              (key2,(a,b,c,x1)), 
//              (key3,(a,b,c,x2)))

答案 1 :(得分:1)

这是一种方法:

val rdd = sc.parallelize(Seq(
  ("x1", ("a", "b", "c", List("key1", "key2"))),
  ("x2", ("a", "b", "c", List("key3")))
))

val rdd2 = rdd.flatMap{
  case (x, (a, b, c, l)) => l.map( (_, (a, b, c, x) ) )
}

rdd2.collect
// res1: Array[(String, (String, String, String, String))] = Array((key1,(a,b,c,x1)), (key2,(a,b,c,x1)), (key3,(a,b,c,x2)))