我们最近开始使用天蓝色数据块上的结构化流媒体。 目前,我们正在从事件中心消费事件,并将它们写入as azure datalake商店作为镶木地板。
我能够将流写入控制台,但是当我们尝试将它们写入任何物理存储时遇到错误(Blob / Azure Datalake)“java.util.NoSuchElementException:key not found:”
val schema = new StructType()
.add("col1",StringType, nullable = true)
.add("col2", StringType, nullable = true)
.add("col3", StringType, nullable = true)
.add("col4",StringType, nullable = true)
val messages = incomingStream.selectExpr("offset","partitionKey","cast (body as string) AS Content")
val structuredMsg = messages.select($"offset",$"partitionKey",from_json(col("Content"),schema).alias("data"))
val results = structuredMsg.
select($"offset",$"partitionKey",current_date().as("date_1"),$"data.col1".as("col1"),$"data.col2".as("col2"),$"data.col3".as("col3"),$"data.col4".as("col4"))
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
results.
withColumn("date", $"date_1").
writeStream.
format("text"). // write as Parquet partitioned by date
partitionBy("date").
option("path", "dbfs:/mnt/datalake/XXX-databricks-mount/XXX-databricks/test").
option("checkpointLocation", "dbfs:/checkpoint_path/").
trigger(Trigger.ProcessingTime(60.seconds)).
outputMode(OutputMode.Append).
start
java.util.NoSuchElementException: key not found: {"ehName":"test1","partitionId":1}
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.sql.eventhubs.EventHubsSource$$anonfun$getBatch$2.apply(EventHubsSource.scala:233)
at org.apache.spark.sql.eventhubs.EventHubsSource$$anonfun$getBatch$2.apply(EventHubsSource.scala:231)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.eventhubs.EventHubsSource.getBatch(EventHubsSource.scala:231)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:394)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
答案 0 :(得分:0)
与writeStream语句分开指定sparkSession检查点路径,代码工作正常。
spark.conf.set("spark.sql.streaming.checkpointLocation", "dbfs:/checkpoint_path/");
results.writeStream.outputMode("append").format("csv").option("path", "dbfs:/mnt/datalake/XXX-databricks-mount/XXX-databricks/test").start().awaitTermination()
答案 1 :(得分:0)
您可能在多个流作业中使用相同的检查点位置。一个流开始写,另一个试图读取它并解释条目,这将导致错误。
我从2个事件中心读取了相同的问题,并使用了相同的检查点位置,这导致我的第二份工作试图从不存在的主题/分区组合中读取。