K,V与spark中的数组字符串中的第一个元素配对

时间:2017-10-27 16:19:51

标签: scala apache-spark

我的RDD为Array[Array[String]] = Array(Array(12345, 1232A, 66QQ2, ASC42, 0003A, 2294AA, AGDT33, 23881), Array(536366, 22633, 22632)....)

我希望输出为

Array[(String, String)] = Array((12345,1232A), (12345,66QQ2)....

2 个答案:

答案 0 :(得分:1)

尝试flatmap转换并使用其余元素发出数组的第一个元素:

rdd.flatMap(s => {
      var output = new ListBuffer[Tuple2[String,String]]()
      for (i <- 1 to (s.length - 1)) {
        output+=((s(0), s(i)) )
      }
      output
    }).foreach(println);

答案 1 :(得分:0)

尝试使用rdd Map和Stream将每个内部数组的头部用其尾部的每个元素压缩。

val test: Array[Array[String]] = Array(Array("12345", "1232A", "66QQ2", "ASC42", "0003A", "2294AA", "AGDT33", "23881"), Array("536366", "22633", "22632"))
val TestRdd = sc.parallelize(test)
val finalOutput: Array[(String,String)] = (TestRdd map(xs => (Stream.continually(xs.head) zip xs.tail).toList)).flatten

// finalOutput is 
// res8: Array[(String, String)] = Array((12345,1232A), (12345,66QQ2), (12345,ASC42), (12345,0003A), (12345,2294AA), (12345,AGDT33), (12345,23881), (536366,22633), (536366,22632))