我正在单个Spark应用程序中启动并发Spark作业,大约2%的时间,我会收到如下错误:
Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException: No Encoder found for scala.Option[String]
- field (class: "scala.Option", name: "my_field")
- root class: "my.package.Clazz"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at my.package.Application
这源自org.apache.spark.sql.catalyst.ScalaReflection
。
根据Spark的文档,在应用程序中,Spark“是完全线程安全的,并支持此用例以启用可满足多个请求的应用程序”:https://spark.apache.org/docs/latest/job-scheduling.html
并发是我的应用程序所必需的。
特定的用例涉及将Cassandra的多个分区键读取到Dataset [T]中,Spark Cassandra驱动程序当前不能很好地处理该键。我的代码如下:
var continue = true
while (continue)
futures += executor.submit(
new Runnable {
override def run(): Unit = {
val factory = new ClassBasedRowReaderFactory[T]()
val reader = factory.rowReader(tableDef, caseClassColumns)
val sqlHour = new sql.Timestamp(calendar.getTime.getTime)
val rawRows = new ListBuffer[T]()
session.execute(preparedStatement.bind(sqlHour)).forEach(new Consumer[Row] {
override def accept(row: Row): Unit = {
val metadata = CassandraRowMetadata(tableColumns)
rawRows += reader.read(row, metadata)
}
})
val rowsDataset = sparkSession
.createDataset(rawRows)
.coalesce(1)
val destination = ...
rowsDataset.write
.mode(SaveMode.ErrorIfExists)
.parquet(destination)
}
}
)
calendar.add(Calendar.HOUR_OF_DAY, 1)
if (...) {
continue = false
}
}
futures.foreach(_.get())
和我正在使用的特定T的定义如下
case class Clazz(
my_field: Option[String],
...
)
我正在运行Spark 2.4.0。令我担心的是,这大约有98%的时间有效,而有2%的时间却无效(在相同数据上)。我安排并发作业是否错误,或者这是Spark中的问题?