当flink group case类对象时的NPE

时间:2016-11-04 15:22:19

标签: dataset batch-processing apache-flink

我使用dataSet API,我有两种案例类

case class Geo(country:Int, province:Int, city:Int, county:Int)


case class AntiFraudLog(
    eventType: Int,
    valid: Boolean    
  )

case class AntiFraudSession(fraudLogs: Seq[AntiFraudLog])

然后我生成了一个键/值对,其值是一个案例类。

 val dataKeyValue: DataSet[(Long, AntiFraudLog)]

并尝试使用相同的键

对项目进行分组
val groupedSortedData = dataKeyValue groupBy 0

然后将分组数据转换为另一个案例类

 val sessionData:DataSet[AntiFraudSession] = groupedSortedData reduceGroup(
  logs => AntiFraudSession(logs.map(_._2).toSeq)
  )

但是当我运行程序时,我遇到了这样的异常

Caused by: java.lang.NullPointerException
    at org.apache.flink.api.scala.typeutils.TraversableSerializer.serialize(TraversableSerializer.scala:90)
    at org.apache.flink.api.scala.typeutils.TraversableSerializer.serialize(TraversableSerializer.scala:32)
    at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:100)
    at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:30)
    at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:100)
    at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:30)
    at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:56)
    at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:83)
    at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:85)
    at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
    at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
    at org.apache.flink.api.scala.DataSet$$anon$5$$anonfun$flatMap$1.apply(DataSet.scala:417)
    at org.apache.flink.api.scala.DataSet$$anon$5$$anonfun$flatMap$1.apply(DataSet.scala:417)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.flink.api.scala.DataSet$$anon$5.flatMap(DataSet.scala:417)
    at org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80)
    at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
    at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:163)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
    at java.lang.Thread.run(Thread.java:745)

有谁知道如何修复它?

2 个答案:

答案 0 :(得分:0)

看起来Flink无法序列化具有空值的集合字段的案例类。在您的场景中,AntiFraudSession将使用fraudLogs = null。您认为是否有更多转换逻辑可能会导致会话中出现这样的元素?

答案 1 :(得分:0)

在使用Scala在Flink中反序列化时,请确保case-class中没有空对象/值。

要避免java.lang.NullPointerException,请在将为空的case类字段/对象中使用Option

根据您的示例:

如果任何字段为空

case class AntiFraudLog(
    eventType: Option[Int], 
    valid: Boolean    
  )

如果案例类对象为空

case class AntiFraudSession(fraudLogs: Option[Seq[AntiFraudLog]]) 

注意: :在Scala中使用null不是一个好习惯/标准。因此,尝试使用Scala中提供的许多其他选项来避免这种情况。

有关更多详细信息,请单击here