拆分RDD的字符串并在一个语句中与其他RDD元素组合

时间:2019-01-10 15:00:12

标签: apache-spark rdd

我可以flatMap RDD的第二个元素。

val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
                               (1, "I am fine"),
                               (2, "Yes you are")
                             )
                        )
val rdd2 = rdd.flatMap(x => x._2.split(" "))

但是,我想立即将x._1附加到x._2的每个拆分项上,以形成一个元组(字符串,整数)。由于某种原因,我看不到它-我也不想转换为DF数组并爆炸。有什么想法吗?

2 个答案:

答案 0 :(得分:1)

只需遍历数组(拆分结果)并附加所需的值即可:

val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
                               (1, "I am fine"),
                               (2, "Yes you are")
                             )
                        )
val rdd2 = rdd.flatMap(x => x._2.split(" ").map(item => s"${item}+${x._1}"))

答案 1 :(得分:1)

您也可以在df()抽象上获得相同的结果。检查一下

  val df = Seq( (1, "Hello how are you"),(1, "I am fine"),(2, "Yes you are")).toDF("a","b")
  df.show(false)
  df.flatMap( r => { val y = r.getString(1).split(" ");  ( 0 until y.size).map( i => (r.getInt(0), y(i))) }).show

结果:

+---+-----------------+
|a  |b                |
+---+-----------------+
|1  |Hello how are you|
|1  |I am fine        |
|2  |Yes you are      |
+---+-----------------+

+---+-----+
| _1|   _2|
+---+-----+
|  1|Hello|
|  1|  how|
|  1|  are|
|  1|  you|
|  1|    I|
|  1|   am|
|  1| fine|
|  2|  Yes|
|  2|  you|
|  2|  are|
+---+-----+