我正在尝试使用spark job server API(针对spark 2.2.0)构建应用程序。但是我发现使用sparkSession不支持namedObject。 我看起来像:
import com.typesafe.config.Config
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.scalactic._
import spark.jobserver.{NamedDataFrame, NamedObjectSupport, SparkSessionJob}
import spark.jobserver.api.{JobEnvironment, SingleProblem, ValidationProblem}
import scala.util.Try
object word1 extends SparkSessionJob with NamedObjectSupport {
type JobData = Seq[String]
type JobOutput = String
def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput =
{
val df = sparkSession.sparkContext.parallelize(data)
val ndf = NamedDataFrame(df, true, StorageLevel.MEMORY_ONLY)
this.namedObjects.update("df1", ndf)
this.namedObjects.getNames().toString
}
def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
}
但是在this.namedObjects.update()行有错误。我认为他们不支持namedObject。虽然使用SparkJob编译相同的代码:
object word1 extends SparkJob with NamedObjectSupport
是否支持带有sparksession的namedObjects?如果没有,那么有什么方法可以解决持久化数据框/数据集的问题?
答案 0 :(得分:0)
我明白了。这是我身边的愚蠢错误。来自https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server-api/src/main/scala/spark/jobserver/NamedObjectSupport.scala#L138。正如它所说:
//由于api.SparkJobBase中的JobEnvironment,不再需要NamedObjectSupport。也是 //自动导入旧的spark.jobserver.SparkJobBase以实现兼容性。
@Deprecated
trait NamedObjectSupport
因此,要访问这些功能,我们需要将此代码修改为:
import com.typesafe.config.Config
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.scalactic._
import spark.jobserver.{NamedDataFrame, NamedObjectSupport, SparkSessionJob}
import spark.jobserver.api.{JobEnvironment, SingleProblem, ValidationProblem}
import scala.util.Try
object word1 extends SparkSessionJob with NamedObjectSupport {
type JobData = Seq[String]
type JobOutput = String
def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput =
{
val df = sparkSession.sparkContext.parallelize(data)
val ndf = NamedDataFrame(df, true, StorageLevel.MEMORY_ONLY)
runtime.namedObjects.update("df1", ndf)
runtime.namedObjects.getNames().toString
}
def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
}