用实木复合地板书写时修改时间戳

时间:2020-08-31 20:05:03

标签: scala apache-spark google-bigquery airflow parquet

我有一个Spark应用程序,该应用程序可加载CSV文件,将其转换为Parquet文件,将Parquet文件存储在Data Lake存储中,然后将数据加载到BigQuery表中。

问题是,当CSV的旧时间戳值过多时,会发生转换,但不能在BigQuery表中显示timestamp列。

当我将配置spark.sql.parquet.outputTimestampType设置为TIMESTAMP_MICROS时,我在BigQuery上收到此错误:

Cannot return an invalid timestamp value of -62135607600000000 microseconds relative to the Unix epoch. The range of valid timestamp values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]; error in writing field reference_date

当我将配置spark.sql.parquet.outputTimestampType设置为TIMESTAMP_MILLIS时,我在Airflow上收到此错误:

Error while reading data, error message: Invalid timestamp value -62135607600000 for field 'reference_date' of type 'INT64' (logical type 'TIMESTAMP_MILLIS'): generic::out_of_range: Invalid timestamp value: -62135607600000
  • CSV文件:
id,reference_date
"6829baef-bcd9-412a-a2f3-abdfed02jsd","0001-01-02 21:00:00"
  • 读取CSV文件(并将reference_date转换为“时间戳记”列):
def castDFColumn(
  df: DataFrame,
  column: String,
  dataType: DataType
): DataFrame = df.withColumn(column, df(column).cast(dataType))

...
var df = spark
  .read
  .format("csv")
  .option("header", true)
  .load("myfile.csv")

df = castDFColumn(df, "reference_date", TimestampType)
  • 转换为Parquet文件:
df
  .write
  .mode("overwrite")
  .parquet("path/to/save")
  • 火花应用程序运行时配置:
val conf = new SparkConf().setAppName("Load CSV")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS/TIMESTAMP_MICROS")
conf.set("spark.sql.session.timeZone", "UTC")

似乎时间戳已更改为0000-12-31 21:00:00,或类似的东西,超出了INT64时间戳的可接受范围。

有人经历过这个吗?

0 个答案:

没有答案