Spark结构化流式传输替换列的值

时间:2018-04-26 19:21:15

标签: apache-spark spark-structured-streaming

我有以下数据框

val tDataJsonDF = kafkaStreamingDFParquet
   .filter($"value".contains("tUse"))
   .filter($"value".isNotNull)
   .selectExpr("cast (value as string) as tdatajson", "cast (topic as string) as env")
   .select(from_json($"tdatajson", schema = ParquetSchema.tSchema).as("data"), $"env".as("env"))
   .select("data.*", "env")
   .select($"date", <--YYYY/MM/dd
           $"time",
           $"event",
           $"serviceGroupId",
           $"userId",
           $"env")

此流式数据框的列日期格式为YYYY/MM/dd

由于这个原因,当我将此列用作镶木地板中的分区列时,Spark会将分区创建为date=2018%04%12

我是否可以在上面的代码中动态修改列值,以便日期值为YYYY-MM-ddYYYYMMd

Parquet写查询:

val tunerQuery = tunerDataJsonDF
  .writeStream
  .format("parquet")
  .option("path",pathtodata )
  .option("checkpointLocation", pathtochkpt)
  .partitionBy("date","env","serviceGroupId")
  .start()

1 个答案:

答案 0 :(得分:0)

我假设您使用的是Spark 2.2 +

tDataJsonDF.withColumn("formatted_date",date_format(to_date(col("date"), "YYYY/MM/dd"), "yyyy-MM-dd"))