Question

询问原因：这对面试官来说是一个最受欢迎的问题，因此我尝试找到答案，但到目前为止还无法通过互联网提供的任何内容

要求：我有以下格式的两个文件，我想在其上进行操作（仅使用RDD ）以查找每个年龄，已订阅的已结婚，单身和离婚人数。

File1中：

age subscribed
58  no
44  no
33  no
58 yes

文件2：

age job marital education   default balance
58  management  married tertiary    no  2143
44  technician  single  secondary   no  29
33  entrepreneur    married secondary   no  2
58  management  single  tertiary    no  1387

以下是输出示例：

58 married 0
58 single 1

Answer 1

您可以通过以下方式达到您的要求。

首先读取这两个文件并将其解析为rdds

val rdd1 = sparkContext
  .textFile("path to first csv")
  .map(_.split(" ").toList)
  .map(list => (list.head, list.tail))
val rdd2 = sparkContext
  .textFile("path to second csv")
  .map(_.split(" ").toList)
  .map(list => (list.head, list.tail))

由join为age目的创建配对RDD 。

rdd1.join(rdd2).groupBy(_._1).mapValues(values => {
  val subsCount = values.map(x => counts(0, 0, x._2._1(0), x._2._2(1)))
  (subsCount.map(_._1).sum, subsCount.map(_._2).sum)
}).flatMap(x => Array((x._1, "married", x._2._1), (x._1, "single", x._2._2)))
  .foreach(println)

其中counts是功能

def counts(marriedSubs: Int, singleSubs: Int, subs: String, marrital: String) : Tuple2[Int, Int] = (subs.toLowerCase(), marrital.toLowerCase()) match {
  case ("yes", "married") => (marriedSubs+1, singleSubs)
  case ("yes", "single") => (marriedSubs, singleSubs+1)
  case _ => (marriedSubs, singleSubs)
}

您应该有以下输出

(58,married,1)
(58,single,1)
(33,married,0)
(33,single,0)
(44,married,0)
(44,single,0)

如何使用RDD处理多个文件？

1 个答案: