下面是输入数据帧(实际上这是一个非常大的数据帧)
获取Income_age ts <10个月的员工的最新收入,如果在不到10个月内没有收入数据,则从该帐户的sourcedIncome列中获取值,而不是从Income列中获取 上面的逻辑用于计算下面的收入函数
预期输出数据框
以下是我计划实施的内容
case class employee (EmployeeID: Int, INCOME: Int, INCOMEAGE: Int, JOINDATE: Int, DEPT: String)
val empSchema = new StructType().add("EmployeeID","Int").add("INCOME", "Int").add("INCOMEAGE","Date") . add("JOINDATE","Date"). add("DEPT","String")
//Reading from the File
import sparkSession.implicits._
val readEmpFile = sparkSession.read
.option("sep", ",")
.schema(empSchema)
.csv(INPUT_DIRECTORY)
//Create employee DataFrame
val custDf = readEmpFile.as[employee]
//Adding Salary Column
val groupByDf = custDf.groupByKey(a => a. EmployeeID)
val k = groupByDf.mapGroups((key,value) => performETL(value))
def performETL(empData: Iterator[employee]) : new employee = {
val empList = empData.toList
//calculate income which has Logic to figure out latest income for an account which is < 10 months and returns the latest income
val income = calculateIncome(empList)
for (i <- empList) {
val row = i
return new employee(row.EmployeeID, row.INCOMEAGE , income)
}
return "Done"
}
这是正确的实施方法吗? 如果没有,请建议更好的方法来实现相同的。
解决方案必须适用于批量和结构化流式传输。