Question

我可以从数据库加载数据，然后用这些数据做一些处理。问题是某些表格的日期列为＆＃39;字符串＆＃39;，但其他一些表格将其作为＆＃39; timestamp＆＃39;。

在加载数据之前，我无法知道什么类型的日期列。

> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type

这是我从spark加载数据的方式。

spark.read
              .format("jdbc")
              .option("url", url)
              .option("dbtable", table)
              .option("user", user)
              .option("password", password)
              .load()

有什么方法可以将它们混合在一起吗？或者将其转换为字符串？

Answer 1

您可以在列的类型上进行模式匹配（使用DataFrame＆＃39; 架构）来决定是将字符串解析为时间戳还是仅将其解析按原样使用时间戳 - 并使用unix_timestamp函数进行实际转换：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType

// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
  ("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
  ("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")

// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
  df.schema("date").dataType match {
    case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
    case _ => df
  }
}

// after "normalizing", you can assume date has Timestamp type - 
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)

Answer 2

您可以尝试以下操作：

（1）如果有支持inferSchema功能的版本，请在加载过程中开始使用它。这将产生火花，表明列的数据类型，但并非在所有情况下都适用。另外，请查看输入数据，如果您有引号，建议在加载期间添加一个额外的参数来说明它们。

val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)

（2）要标识列的数据类型，可以使用下面的代码，它将所有列名和数据类型放入它们自己的字符串数组中。

val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)

Answer 3

它有一种解决此问题的简便方法，即get(i: Int): Any。并且它将自动在Spark SQL类型和返回类型之间进行映射。例如

val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)

Answer 4

def parseLocationColumn(df: DataFrame): DataFrame = {
  df.schema("location").dataType match {
    case StringType => df.withColumn("locationTemp", $"location")
      .withColumn("countryTemp", lit("Unknown"))
      .withColumn("regionTemp", lit("Unknown"))
      .withColumn("zoneTemp", lit("Unknown"))
    case _ => df.withColumn("locationTemp", $"location.location")
      .withColumn("countryTemp", $"location.country")
      .withColumn("regionTemp", $"location.region")
      .withColumn("zoneTemp", $"location.zone")
  }
}

Spark，Scala - 列类型确定

4 个答案: