根据跨多行的2列应用的逻辑选择整行

时间:2017-08-30 00:34:28

标签: scala apache-spark apache-spark-sql spark-structured-streaming

下面是输入数据帧(实际上这是一个非常大的数据帧)

enter image description here

获取Income_age ts <10个月的员工的最新收入,如果在不到10个月内没有收入数据,则从该帐户的sourcedIncome列中获取值,而不是从Income列中获取 上面的逻辑用于计算下面的收入函数

预期输出数据框

enter image description here

以下是我计划实施的内容

case class employee (EmployeeID: Int, INCOME: Int, INCOMEAGE: Int, JOINDATE: Int, DEPT: String)

val empSchema = new StructType().add("EmployeeID","Int").add("INCOME", "Int").add("INCOMEAGE","Date") . add("JOINDATE","Date"). add("DEPT","String")

//Reading from the File
import sparkSession.implicits._

val readEmpFile = sparkSession.read
  .option("sep", ",")
  .schema(empSchema)
  .csv(INPUT_DIRECTORY)

//Create employee DataFrame
val custDf = readEmpFile.as[employee]

//Adding Salary Column
val groupByDf = custDf.groupByKey(a => a. EmployeeID)
val k = groupByDf.mapGroups((key,value) => performETL(value))

def performETL(empData: Iterator[employee]) : new employee  = {
  val empList = empData.toList
  //calculate income which has Logic to figure out latest income for an account which is < 10 months and returns the latest income
  val income = calculateIncome(empList)

  for (i <- empList) {    
      val row = i
      return new employee(row.EmployeeID, row.INCOMEAGE , income)
  }
  return  "Done"
}

这是正确的实施方法吗? 如果没有,请建议更好的方法来实现相同的。

解决方案必须适用于批量和结构化流式传输。

0 个答案:

没有答案