我从Kafka消费并在EMRFS中写作镶木地板。以下代码适用于spark-shell
:
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
SBT能够无误地打包代码。当.jar被发送到spark-submit时,该作业被接受并永远保持运行状态而不向HDFS写入数据。
.inprogress日志中没有错误
有些帖子表明水印持续时间很长会导致它,但我没有设置自定义水印持续时间。
答案 0 :(得分:0)
我可以使用Pyspark写入镶木地板,我会把你的代码放在以防有用的地方:
stream = self.spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.kafka_bootstrap_servers) \
.option("subscribe", self.topic) \
.option("startingOffsets", self.startingOffsets) \
.option("max.poll.records", self.max_poll_records) \
.option("auto.commit.interval.ms", self.auto_commit_interval_ms) \
.option("session.timeout.ms", self.session_timeout_ms) \
.option("key.deserializer", self.key_deserializer) \
.option("value.deserializer", self.value_deserializer) \
.load()
self.query = stream \
.select(col("value")) \
.select((self.proto_function("value")).alias("value_udf")) \
.select(*columns,
date_format(column_time, "yyyy").alias("date").alias("year"),
date_format(column_time, "MM").alias("date").alias("month"),
date_format(column_time, "dd").alias("date").alias("day"),
date_format(column_time, "HH").alias("date").alias("hour"))
query = self.query \
.writeStream \
.format("parquet") \
.option("checkpointLocation", self.path) \
.partitionBy("year", "month", "day", "hour") \
.option("path", self.path) \
.start()
此外,您需要以这种方式运行代码:spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 <code>