来自wholeTextFiles的配对RDD

时间:2017-05-22 14:55:11

标签: apache-spark

我正在尝试使用wholeTextfiles并从数据中获取一个配对的RDD但是因为我是新的我对此有点困惑: 这是代码:

val wholefiles = sc.wholeTextFiles("sqoop_import/orders")
wholefiles: org.apache.spark.rdd.RDD[(String, String)] = sqoop_import/orders MapPartitionsRDD[72] at wholeTextFiles at <console>:27

wholefiles.take(5).foreach(println)
(hdfs://filename, 1, 2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED)

如何从上述数据中获得与column4和column1配对的RDD?

1 个答案:

答案 0 :(得分:1)

您可以使用以下代码 -

wholeTextFiles.map(record=>record._2)
              .map(lines=>lines.split("\n"))
              .flatMap(lines=>lines)
              .map(line=>line.split(","))
              .map(fields=>(fields(3),fields(0)))
              .collect()

我希望它有所帮助。