当我尝试将一些Spark结构化的流数据写入CSV时,我看到在HDFS位置生成了空的零件文件。我试图在控制台上写同样的东西,并将数据生成到控制台。
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "/user/sasidhr1/sparkCheckpoint").
config("spark.debug.maxToStringFields",100).
getOrCreate()
val mySchema = StructType(Array(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("year", IntegerType),
StructField("rating", DoubleType),
StructField("duration", IntegerType)
))
val xmlData = spark.readStream.option("sep", ",").schema(mySchema).csv("file:///home/sa1/kafdata/")
import java.util.Calendar
val df_agg_without_time= xmlData.withColumn("event_time", to_utc_timestamp(current_timestamp, Calendar.getInstance().getTimeZone().getID()))
val df_agg_with_time = df_agg_without_time.withWatermark("event_time", "10 seconds").groupBy(window($"event_time", "10 seconds", "5 seconds"),$"year").agg(sum($"rating").as("rating"),sum($"duration").as("duration"))
val pr = df_agg_with_time.drop("window")
pr.writeStream.outputMode("append").format("csv").
option("path", "hdfs://ccc/apps/hive/warehouse/rta.db/sample_movcsv/").start()
如果我不删除(窗口)列,则会发生另一个问题...该问题我已经在此处发布了。How to write windowed aggregation in CSV format?
有人可以帮忙吗?聚合后如何将csv文件写入hdfs ..请帮助