如何修复Spark中的“找不到scala.Option [String]的编码器”

时间:2018-12-19 19:35:03

标签: apache-spark concurrency

我正在单个Spark应用程序中启动并发Spark作业,大约2%的时间,我会收到如下错误:

Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException: No Encoder found for scala.Option[String]
- field (class: "scala.Option", name: "my_field")
- root class: "my.package.Clazz"
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at my.package.Application

这源自org.apache.spark.sql.catalyst.ScalaReflection

根据Spark的文档,在应用程序中,Spark“是完全线程安全的,并支持此用例以启用可满足多个请求的应用程序”:https://spark.apache.org/docs/latest/job-scheduling.html

并发是我的应用程序所必需的。

特定的用例涉及将Cassandra的多个分区键读取到Dataset [T]中,Spark Cassandra驱动程序当前不能很好地处理该键。我的代码如下:

var continue = true
while (continue)
  futures += executor.submit(
    new Runnable {
      override def run(): Unit = {
        val factory = new ClassBasedRowReaderFactory[T]()
        val reader = factory.rowReader(tableDef, caseClassColumns)
        val sqlHour = new sql.Timestamp(calendar.getTime.getTime)

        val rawRows = new ListBuffer[T]()
        session.execute(preparedStatement.bind(sqlHour)).forEach(new Consumer[Row] {
          override def accept(row: Row): Unit = {
            val metadata = CassandraRowMetadata(tableColumns)
            rawRows += reader.read(row, metadata)
          }
        })

        val rowsDataset = sparkSession
          .createDataset(rawRows)
          .coalesce(1)

        val destination = ...
        rowsDataset.write
          .mode(SaveMode.ErrorIfExists)
          .parquet(destination)
      }
    }
  )

  calendar.add(Calendar.HOUR_OF_DAY, 1)
  if (...) {
    continue = false
  }
}

futures.foreach(_.get())

和我正在使用的特定T的定义如下

case class Clazz(
  my_field: Option[String],
  ...
)

我正在运行Spark 2.4.0。令我担心的是,这大约有98%的时间有效,而有2%的时间却无效(在相同数据上)。我安排并发作业是否错误,或者这是Spark中的问题?

0 个答案:

没有答案