我有一个Spark应用程序。我的用例是允许用户定义一个像for i in {1..50};
do
originalFileName=$( wget --content-disposition -S --spider "http://example.com/getfile.php?id={$i}" |& grep "Content-Disposition" |sed -e 's/^.*filename=//' )
wget `http://example.com/getfile.php?id={$i}` -O "$i-$originalFileName"
done
这样的“规则”的任意函数,该函数将应用于RDD / Dataset的每个记录。
以下是代码:
Record => Record
以下是'Record'和'RecordMetadata'和'ScalaExpression'类的定义:
//Sample rows with Id, Name, DOB and address
val row1 = "19283,Alan,1989-01-20,445 Mount Eden Road Mount Eden Auckland"
val row2 = "15689,Ben,1989-01-20,445 Mount Eden Road Mount Eden Auckland"
val record1 = new Record(
new RecordMetadata(),
row1,
true
)
val record2 = new Record(
new RecordMetadata(),
row2,
true
)
val inputRecsList = List(record1, record2)
val inputRecs = spark.sparkContext.parallelize(inputRecsList)
val rule = ScalaExpression(
//Sample rule. A lambda (Record => Record)
"""
| import model.Record
| { record: Record => record }
""".stripMargin
val outputRecs = inputRecs.map(rule.transformation)
case class Record(
val metadata: RecordMetadata,
val row: String,
val isValidRecord: Boolean = true
) extends Serializable
case class RecordMetadata() extends Serializable
上面的代码引发了一个神秘的异常:
case class ScalaExpression(function: Function1[Record, Record]) extends Rule {
def transformation = function
}
object ScalaExpression{
/**
* @param Scala expression as a string
* @return Evaluated result of type Function (Record => Record)
*/
def apply(string: String) = {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(string)
val fn = toolbox.eval(tree).asInstanceOf[(Record => Record)] //Or Function1(Record, Record)
new ScalaExpression(fn)
}
}
但是,如果直接在代码中定义规则,则该代码将运行良好:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2293)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
如果将映射(带有运行时评估规则)的地图而不是RDD / Dataset应用于列表,该代码也可以很好地工作。
试图使其正常工作已被卡住了一段时间。任何帮助将不胜感激。
编辑:标记为该问题的“可能重复”正在解决一个完全不同的问题。我的用例尝试在运行时从用户获取一条规则(将一条记录转换为另一条记录的有效scala语句),并在尝试将该规则应用于数据集的每条记录时导致序列化问题。
最好的问候。
答案 0 :(得分:1)
Spark JIRA上有一个待解决的公开问题-SPARK-20525 出现此问题的原因是由于在加载Spark UDF时spark类加载器不匹配。
此问题的解决方案是在解释器之后加载您的spark会话。请找到示例代码。您也可以参考我的github例如SparkCustomTransformations
trait CustomTransformations extends Serializable {
def execute(spark: SparkSession, df: DataFrame, udfFunctions: AnyRef*): DataFrame
}
// IMPORTANT spark session should be lazy evaluated
lazy val spark = getSparkSession
def getInterpretor: scala.tools.nsc.interpreter.IMain = {
import scala.tools.nsc.GenericRunnerSettings
import scala.tools.nsc.interpreter.IMain
val cl = ClassLoader.getSystemClassLoader
val conf = new SparkConf()
val settings = new GenericRunnerSettings(println _)
settings.usejavacp.value = true
val intp = new scala.tools.nsc.interpreter.IMain(settings, new java.io.PrintWriter(System.out))
intp.setContextClassLoader
intp.initializeSynchronous
intp
}
val intp = getInterpretor
val udf_str =
"""
(str:String)=>{
str.toLowerCase
}
"""
val customTransStr =
"""
|import org.apache.spark.SparkConf
|import org.apache.spark.sql.{DataFrame, SparkSession}
|import org.apache.spark.sql.functions._
|
|new CustomTransformations {
| override def execute(spark: SparkSession, df: DataFrame, func: AnyRef*): DataFrame = {
|
| //reading your UDF
| val str_lower_udf = spark.udf.register("str_lower", func(0).asInstanceOf[Function1[String,String]])
|
| df.createOrReplaceTempView("df")
| val df_with_UDF_cols = spark.sql("select a.*, str_lower(a.fakeEventTag) as customUDFCol1 from df a").withColumn("customUDFCol2", str_lower_udf(col("fakeEventTag")))
|
| df_with_UDF_cols.show()
| df_with_UDF_cols
| }
|}
""".stripMargin
intp.interpret(udf_str)
var udf_obj = intp.eval(udf_str)
val eval = new com.twitter.util.Eval
val customTransform: CustomTransformations = eval[CustomTransformations](customTransStr)
val sampleSparkDF = getSampleSparkDF
val outputDF = customTransform.execute(spark, sampleSparkDF, udf_obj)
outputDF.printSchema()
outputDF.show()