我可以flatMap RDD的第二个元素。
val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
(1, "I am fine"),
(2, "Yes you are")
)
)
val rdd2 = rdd.flatMap(x => x._2.split(" "))
但是,我想立即将x._1附加到x._2的每个拆分项上,以形成一个元组(字符串,整数)。由于某种原因,我看不到它-我也不想转换为DF数组并爆炸。有什么想法吗?
答案 0 :(得分:1)
只需遍历数组(拆分结果)并附加所需的值即可:
val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
(1, "I am fine"),
(2, "Yes you are")
)
)
val rdd2 = rdd.flatMap(x => x._2.split(" ").map(item => s"${item}+${x._1}"))
答案 1 :(得分:1)
您也可以在df()抽象上获得相同的结果。检查一下
val df = Seq( (1, "Hello how are you"),(1, "I am fine"),(2, "Yes you are")).toDF("a","b")
df.show(false)
df.flatMap( r => { val y = r.getString(1).split(" "); ( 0 until y.size).map( i => (r.getInt(0), y(i))) }).show
结果:
+---+-----------------+
|a |b |
+---+-----------------+
|1 |Hello how are you|
|1 |I am fine |
|2 |Yes you are |
+---+-----------------+
+---+-----+
| _1| _2|
+---+-----+
| 1|Hello|
| 1| how|
| 1| are|
| 1| you|
| 1| I|
| 1| am|
| 1| fine|
| 2| Yes|
| 2| you|
| 2| are|
+---+-----+