spark udf with data frame

时间:2016-04-21 21:57:25

标签: apache-spark apache-spark-sql

I am using Spark 1.3. I have a dataset where the dates in column (ordering_date column) are in yyyy/MM/dd format. I want to do some calculations with dates and therefore I want to use jodatime to do some conversions/formatting. Here is the udf that I have :

 val return_date = udf((str: String, dtf: DateTimeFormatter) => dtf.formatted(str))

Here is the code where the udf is being called. However, I get error saying "Not Applicable". Do I need to register this UDF or am I missing something here?

val user_with_dates_formatted = users.withColumn(
  "formatted_date",
  return_date(users("ordering_date"), DateTimeFormat.forPattern("yyyy/MM/dd")
)

1 个答案:

答案 0 :(得分:2)

I don't believe you can pass in the DateTimeFormatter as an argument to the UDF. You can only pass in a Column. One solution would be to do:

val return_date = udf((str: String, format: String) => {
  DateTimeFormat.forPatten(format).formatted(str))
})

And then:

val user_with_dates_formatted = users.withColumn(
  "formatted_date",
  return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)

Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd using forPattern for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter], maybe like this (thoroughly untested, but you get the idea):

object DateFormatters {
  var formatters = Map[String,DateTimeFormatter]()

  def getFormatter(format: String) : DateTimeFormatter = {
    if (formatters.get(format).isEmpty) {
      formatters = formatters + (format -> DateTimeFormat.forPattern(format))
    }
    formatters.get(format).get
  }
}

Then you would change your UDF to:

val return_date = udf((str: String, format: String) => {
  DateFormatters.getFormatter(format).formatted(str))
})

That way, DateTimeFormat.forPattern(...) is only called once per format per executor.

One thing to note about the singleton object solution is that you can't define the object in the spark-shell -- you have to pack it up in a JAR file and use the --jars option to spark-shell if you want to use the DateFormatters object in the shell.