不使用Lambda时,Spark Java API任务不可序列化

时间:2017-09-17 17:05:32

标签: apache-spark java-8

我在Spark(2.2.0)中看到一种行为我不明白,但在尝试提取lambda函数时,猜测它与Lambda和Anonymous类有关:

这有效:

public class EventsFilter
{
    public Dataset< String > filter( Dataset< String > events )
    {
        return events.filter( ( FilterFunction< String > ) x -> x.length() > 3 );
    }
}

但事实并非如此:

public class EventsFilter
{
    public Dataset< String > filter( Dataset< String > events )
    {
        FilterFunction< String > filter = new FilterFunction< String >(){
            @Override public boolean call( String value ) throws Exception
            {
                return value.length() > 3;
            }
        };
        return events.filter( filter );
    }
}

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) ...
...
Caused by: java.io.NotSerializableException: ...EventsFilter
   ..Serialization stack:
- object not serializable (class: ...EventsFilter, 
value:...EventsFilter@e521067)
    - field (class: .EventsFilter$1, name: this$0, type: class ..EventsFilter)
.   - object (class ...EventsFilter$1, ..EventsFilter$1@5c70d7f0)
.   - element of array (index: 1)
    - array (class [Ljava.lang.Object;, size 4)
    - field (class: 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)

我正在测试:

@Test
public void test()
{
    EventsFilter filter = new EventsFilter();
    Dataset<String> input = SparkSession.builder().appName( "test" ).master( "local" ).getOrCreate()
            .createDataset( Arrays.asList( "123" , "123"  , "3211" ) ,
                Encoders.kryo( String.class ) );

    Dataset<String> res = filter.filter( input );
    assertThat( res.count() , is( 1l ) );
}

即使更奇怪,当放入静态主体时,两者似乎都有效......

如何在方法中明确定义函数,导致那个偷偷摸摸的'this'引用序列化?

2 个答案:

答案 0 :(得分:3)

Java的内部类保存对外部类的引用。您的外部类不可序列化,因此抛出异常。

如果不使用该引用,Lambdas不会保留引用,因此不可序列化的外部类没有问题。更多here

答案 1 :(得分:1)

我误以为Lambdas是作为内部阶级实施的。情况不再如此(非常有帮助talk)。 另外,正如T.Gawęda所回答的那样,内部类实际上确实引用了外部类,即使它不需要(here)。这种差异解释了这种行为。