如何使用Pyspark将一个rdd映射到另一个rdd?

时间:2018-03-19 15:49:32

标签: apache-spark pyspark rdd

rdd1labels(0,1,4),另有rdd2我有文字。我想将rdd1映射到rdd2,以便row1的{​​{1}}映射到rdd1的{​​{1}},依此类推。

我试过了:

row1

它给了我错误:

rdd2

有人可以指导我吗? 样品输出:rdd1- 标签& rdd2- 文字

rdd2.join(rdd1.map(lambda x: (x[0], x[0:])))

Sample output

1 个答案:

答案 0 :(得分:0)

如果您有rdd1

val rdd1 = sc.parallelize(List(0,0,4,1,4,1))

rdd2

val rdd2 = sc.parallelize(List("i hate painting i have white paint all over my hands.",
  "Bawww I need a haircut  No1 could fit me in before work tonight. Sigh.",
  "I had a great day",
  "what is life.",
  "He sings so good",
  "i need to go to sleep  ....goodnight"))
  
    

我想用rdd2映射rdd1,使得rdd1的row1与rdd2的row1映射,依此类推。

  

使用zip功能

一个简单的zip函数应符合您的要求

rdd1.zip(rdd2)

将输出为

(0,i hate painting i have white paint all over my hands.)
(0,Bawww I need a haircut  No1 could fit me in before work tonight. Sigh.)
(4,I had a great day)
(1,what is life.)
(4,He sings so good)
(1,i need to go to sleep  ....goodnight)

zipWithIndex并加入

这种方法可以使用zip为您提供与上述相同的输出(此方法也很昂贵)

rdd1.zipWithIndex().map(_.swap).join(rdd2.zipWithIndex().map(_.swap)).map(_._2)

我希望答案很有帮助