如何使用RDD处理多个文件?

时间:2017-12-14 03:30:29

标签: scala apache-spark

询问原因: 这对面试官来说是一个最受欢迎的问题,因此我尝试找到答案,但到目前为止还无法通过互联网提供的任何内容

要求: 我有以下格式的两个文件,我想在其上进行操作(仅使用RDD )以查找每个年龄,已订阅的已结婚,单身和离婚人数。

File1中:

age subscribed
58  no
44  no
33  no
58 yes

文件2:

age job marital education   default balance
58  management  married tertiary    no  2143
44  technician  single  secondary   no  29
33  entrepreneur    married secondary   no  2
58  management  single  tertiary    no  1387

以下是输出示例:

58 married 0
58 single 1

1 个答案:

答案 0 :(得分:0)

您可以通过以下方式达到您的要求。

首先读取这两个文件并将其解析为rdds

val rdd1 = sparkContext
  .textFile("path to first csv")
  .map(_.split(" ").toList)
  .map(list => (list.head, list.tail))
val rdd2 = sparkContext
  .textFile("path to second csv")
  .map(_.split(" ").toList)
  .map(list => (list.head, list.tail))

joinage目的创建配对RDD

rdd1.join(rdd2).groupBy(_._1).mapValues(values => {
  val subsCount = values.map(x => counts(0, 0, x._2._1(0), x._2._2(1)))
  (subsCount.map(_._1).sum, subsCount.map(_._2).sum)
}).flatMap(x => Array((x._1, "married", x._2._1), (x._1, "single", x._2._2)))
  .foreach(println)

其中counts功能

def counts(marriedSubs: Int, singleSubs: Int, subs: String, marrital: String) : Tuple2[Int, Int] = (subs.toLowerCase(), marrital.toLowerCase()) match {
  case ("yes", "married") => (marriedSubs+1, singleSubs)
  case ("yes", "single") => (marriedSubs, singleSubs+1)
  case _ => (marriedSubs, singleSubs)
}

您应该有以下输出

(58,married,1)
(58,single,1)
(33,married,0)
(33,single,0)
(44,married,0)
(44,single,0)