Spark任务不具有滞后窗口功能可序列化

时间:2016-05-18 13:34:35

标签: scala apache-spark serialization apache-spark-sql window-functions

我注意到,如果我使用函数调用map(),我在DataFrame上使用Window函数后,Spark会返回“Task not serializable”异常 这是我的代码:

val hc:org.apache.spark.sql.hive.HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hc.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def f():String = "test"
case class P(name:String,surname:String)
val lag_result:org.apache.spark.sql.Column = lag($"name",1).over(Window.partitionBy($"surname"))
val lista:List[P] = List(P("N1","S1"),P("N2","S2"),P("N2","S2"))
val data_frame:org.apache.spark.sql.DataFrame = hc.createDataFrame(sc.parallelize(lista))
df.withColumn("lag_result", lag_result).map(x => f)
//df.withColumn("lag_result", lag_result).map{case x => def f():String = "test";f}.collect // This works

这就是Stack Trace:

  

org.apache.spark.SparkException:任务不可序列化   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)     在   org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122)     在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:324)at at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:323)at ...   更多引起:java.io.NotSerializableException:   org.apache.spark.sql.Column序列化堆栈:      - 对象不可序列化(类:org.apache.spark.sql.Column,值:'lag(name,1,null)windowspecdefinition(surname,UnspecifiedFrame))      - 字段(类:$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ $ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表$$ $$ IWC万国表,   name:lag_result,type:class org.apache.spark.sql.Column)... and   更

1 个答案:

答案 0 :(得分:10)

Sub Add_Value_If_Not_Duplicate() Dim My_Stuff As New Collection Dim End_of_Data, i, t As Integer Dim Unique_Value As Boolean End_of_Data = 13 'This will obviously be different for you Unique_Value = True Dim New_Value As Integer 'Loops through Column A of sheet 2 to demonstrate the approach For i = 1 To End_of_Data 'Iterate through the data in your excel sheet New_Value = Sheet2.Range("A" & i).Value 'Store the new value in a variable If My_Stuff.Count > 0 Then 'If you have previously read values 'Looping through previously recorded values For t = 1 To My_Stuff.Count If My_Stuff(t) = New_Value Then Unique_Value = False 'If value already exist mark a flag Exit For End If Next Else 'If you have no previously read values End If 'Add if Unique If Unique_Value = True Then 'If value isn't already listed then add it My_Stuff.Add (New_Value) End If Unique_Value = True 'Reset your Next For i = 1 To My_Stuff.Count Sheet2.Range("B" & i) = My_Stuff(i) 'Printing to demonstrate Next End Sub 返回不可序列化的lag。同样的事情也适用于o.a.s.sql.Column。在交互模式下,这些对象可以作为WindowSpec

的闭包的一部分包含在内
map

一个简单的解决方案是将两者标记为瞬态:

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val df = Seq(("foo", 1), ("bar", 2)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: string, y: int]

scala> val w = Window.partitionBy("x").orderBy("y")
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@307a0097

scala> val lag_y = lag(col("y"), 1).over(w)
lag_y: org.apache.spark.sql.Column = 'lag(y,1,null) windowspecdefinition(x,y ASC,UnspecifiedFrame)

scala> def f(x: Any) = x.toString
f: (x: Any)String

scala> df.select(lag_y).map(f _).first
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
...
Caused by: java.io.NotSerializableException: org.apache.spark.sql.expressions.WindowSpec
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.expressions.WindowSpec, value: org.apache.spark.sql.expressions.WindowSpec@307a0097)