使用groupByKey添加值

时间:2017-12-16 01:10:38

标签: scala apache-spark dataframe apache-spark-sql

我在scala和Spark中遇到var dropX = []; var dropY = []; var snowX = []; var i; for(i = 0; i < 100; i++) { dropX[i] = 40; dropY[i] = 0; snowX[i] = 40; }的麻烦。 我有2个案例类:

groupByKey

目前我使用的是第二个案例类:

case class Employee(id_employee: Long, name_emp: String, salary: String)

但是,我想用这个新的替换它:

case class Company(id_company: Long, employee:Seq[Employee])

我使用case class Company(id_company: Long, name_comp: String employee:Seq[Employee]) 创建groupByKey个对象的父数据集(df1):

Company

此代码有效,它返回如下对象:

val companies = df1.groupByKey(v => v.id_company)
.mapGroups(
  {
    case(k,iter) => Company(k, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)
  }
).collect()

但我没有找到将公司name_comp添加到这些对象的提示(此字段存在df1)。为了检索这样的对象(使用新的case类):

Company(1234,List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))

1 个答案:

答案 0 :(得分:2)

由于您既需要公司ID和名称,但您可以做的是在对数据进行分组时使用元组作为键。这将在构造Company类时轻松提供这两个值:

df1.groupByKey(v => (v.id_company, v.name_comp))
  .mapGroups{ case((id, name), iter) => 
    Company(id, name, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)}
  .collect()