询问原因: 这对面试官来说是一个最受欢迎的问题,因此我尝试找到答案,但到目前为止还无法通过互联网提供的任何内容
要求: 我有以下格式的两个文件,我想在其上进行操作(仅使用RDD )以查找每个年龄,已订阅的已结婚,单身和离婚人数。
File1中:
age subscribed
58 no
44 no
33 no
58 yes
文件2:
age job marital education default balance
58 management married tertiary no 2143
44 technician single secondary no 29
33 entrepreneur married secondary no 2
58 management single tertiary no 1387
以下是输出示例:
58 married 0
58 single 1
答案 0 :(得分:0)
您可以通过以下方式达到您的要求。
首先读取这两个文件并将其解析为rdds
val rdd1 = sparkContext
.textFile("path to first csv")
.map(_.split(" ").toList)
.map(list => (list.head, list.tail))
val rdd2 = sparkContext
.textFile("path to second csv")
.map(_.split(" ").toList)
.map(list => (list.head, list.tail))
由join
为age
目的创建配对RDD 。
rdd1.join(rdd2).groupBy(_._1).mapValues(values => {
val subsCount = values.map(x => counts(0, 0, x._2._1(0), x._2._2(1)))
(subsCount.map(_._1).sum, subsCount.map(_._2).sum)
}).flatMap(x => Array((x._1, "married", x._2._1), (x._1, "single", x._2._2)))
.foreach(println)
其中counts
是功能
def counts(marriedSubs: Int, singleSubs: Int, subs: String, marrital: String) : Tuple2[Int, Int] = (subs.toLowerCase(), marrital.toLowerCase()) match {
case ("yes", "married") => (marriedSubs+1, singleSubs)
case ("yes", "single") => (marriedSubs, singleSubs+1)
case _ => (marriedSubs, singleSubs)
}
您应该有以下输出
(58,married,1)
(58,single,1)
(33,married,0)
(33,single,0)
(44,married,0)
(44,single,0)