我正在尝试使用文本数据文件编写spark连接。但我的加入并没有像我期望的那样发挥作用。
val sc = new SparkContext("local[*]", "employeedata")
val empoyees= sc.textFile("../somewhere/employee.data")
val reputations= sc.textFile("../somewhere/reputations.data")
val employeesRdd= empoyees.map(x=> (x.toString().split(",")(0), x))
val reputationsRdd= reputations.map(y=> (y.toString().split(",")(0), y))
val joineddata = employeesRdd.join(reputationsRdd).map(_._2)
employee.data就像下面的
emp_id,名字,姓氏,年龄,国家,教育
reputations.data就像下面的
emp_id,声誉
但我得到的结果如下所示
(empid,名字,姓氏,年龄,国家,教育,员工,声誉)
但我需要以下输出
(empid,名字,姓氏,年龄,国家,教育,声誉)
应删除员工ID和教育之间的额外逗号,并且还应删除声誉前的员工ID有人可以帮助我吗?
答案 0 :(得分:0)
这里有一些伪代码(如果我们很幸运,它可能会编译甚至可以工作!)给你一些帮助:
// split the fields and key by id
// you could map the arrays to case classes here
val employeesRdd= empoyees.map(x=> x.toString().split(","))
.keyBy(e => e(0))
val reputationsRdd= reputations.map(y=> y.toString().split(","))
.keyBy(r => r(0))
val joineddata = employeesRdd.join(reputationsRdd)
.map { case (key, (Array(emp_id, firstname,lastname,age,country,Education), Array(employee_id, reputation))) =>
(empid,first name, last name, age, country,education,reputation)
}