Spark Scala - 当前日期和最大值(日)之间的差异

时间:2018-05-07 15:22:30

标签: scala apache-spark dataframe date-difference

需要计算两个日期之间的差异。问题是

Currentdate - max(day_id)

“Currentdate”具有简单的日期格式 - yyyyMMdd

“day_id”是字符串格式,其值为yyyy-mm-dd

我有一个数据框,它将日期(字符串格式)转换为日期格式(yyyy-mm-dd)

df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))

通常,为了找到max(day_id),我会做

def daySince (columnName: String): Column = { max(col(columnName))

我无法弄清楚如何区分

Currentdate - max(day_id)

1 个答案:

答案 0 :(得分:2)

将带有架构的输入数据框作为

// ** notional code - does not compile **
def parse[T](args: Seq[String], klass: Class[T]): T = {
  val expectedTypes = klass.getDeclaredFields.map(_.getGenericType)
  val typedArgs = args.zip(expectedTypes).map({
      case (arg, String)      => arg
      case (arg, Int)         => arg.toInt
      case (arg, unknownType) => 
        throw new RuntimeException(s"Unsupported type $unknownType")
  })
  (klass.getConstructor(typedArgs).newInstance _).tupled(typedArgs)
}

您可以使用+---+----------+ |id |day_id | +---+----------+ |id1|2017-11-21| |id1|2018-01-21| |id2|2017-12-21| +---+----------+ root |-- id: string (nullable = true) |-- day_id: string (nullable = true) current_date() 内置功能来满足您的要求

datediff()

应该给你

import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))