Flink Streaming:从一个窗口,在另一个窗口中查找状态

时间:2016-11-11 18:05:05

标签: apache-flink flink-streaming

我有两个流:

  • 测量
  • WhoMeasured(关于谁参加测量的元数据)

这些是他们的案例类:

case class Measurement(var value: Int, var who_measured_id: Int)
case class WhoMeasured(var who_measured_id: Int, var name: String)

Measurement流包含大量数据。 WhoMeasured流很少。实际上,对于who_measured_id流中的每个WhoMeasured,只有一个名称是相关的,因此如果具有相同who_measured_id的名称到达,则可以丢弃旧元素。这本质上是一个由WhoMeasured流填充的HashTable。

在我的自定义窗口功能

class WFunc extends WindowFunction[Measurement, Long, Int, TimeWindow] {
  override def apply(key: Int, window: TimeWindow, input: Iterable[Measurement], out: Collector[Long]): Unit = {

    // Here I need access to the WhoMeasured stream to get the name of the person who took a measurement
    // The following two are equivalent since I keyed by who_measured_id
    val name_who_measured = magic(key)
    val name_who_measured = magic(input.head.who_measured_id)
  }
}

这是我的工作。正如您可能看到的那样,缺少一些东西:两个流的组合。

val who_measured_stream = who_measured_source
  .keyBy(w => w.who_measured_id)
  .countWindow(1)

val measurement_stream = measurements_source
  .keyBy(m => m.who_measured_id)
  .timeWindow(Time.seconds(60), Time.seconds(5))
  .apply(new WFunc)

因此,实质上这是一种查找表,当WhoMeasured流中的新元素到达时,它会更新。

所以问题是:如何从一个WindowedStream到另一个[info] Loading project definition from /home/jgroeger/Code/MeasurementJob/project [info] Set current project to MeasurementJob (in build file:/home/jgroeger/Code/MeasurementJob/) [info] Compiling 8 Scala sources to /home/jgroeger/Code/MeasurementJob/target/scala-2.11/classes... [info] Running de.company.project.Main dev MeasurementJob [error] Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields. [error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:100) [error] at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1478) [error] at org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:161) [error] at org.apache.flink.streaming.api.datastream.ConnectedStreams.flatMap(ConnectedStreams.java:230) [error] at org.apache.flink.streaming.api.scala.ConnectedStreams.flatMap(ConnectedStreams.scala:127) [error] at de.company.project.jobs.MeasurementJob.run(MeasurementJob.scala:139) [error] at de.company.project.Main$.main(Main.scala:55) [error] at de.company.project.Main.main(Main.scala) [error] Caused by: java.io.NotSerializableException: de.company.project.jobs.MeasurementJob [error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) [error] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) [error] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) [error] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) [error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) [error] at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) [error] at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:301) [error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:81) [error] ... 7 more java.lang.RuntimeException: Nonzero exit code returned from runner: 1 at scala.sys.package$.error(package.scala:27) [trace] Stack trace suppressed: run last MeasurementJob/compile:run for the full output. [error] (MeasurementJob/compile:run) Nonzero exit code returned from runner: 1 [error] Total time: 9 s, completed Nov 15, 2016 2:28:46 PM Process finished with exit code 1 实现这样的查找?

跟进:

在Fabian建议的方式实施之后,作业总是因某种序列化问题而失败:

The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields.

错误消息:

JoiningCoFlatMap

但是,ValueState所拥有的唯一字段是建议的class JoiningCoFlatMap extends RichCoFlatMapFunction[Measurement, WhoMeasured, (Measurement, String)] {

签名如下:

=TRIM(PROPER(LEFT(D1,FIND("(",D1)-1)))

1 个答案:

答案 0 :(得分:2)

我认为您要做的是窗口操作,然后是连接。

您可以使用有状态CoFlatMapFunction实现高容量流和低价值逐个数据流的连接,如下例所示:

val measures: DataStream[Measurement] = ???
val who: DataStream[WhoMeasured] = ???

val agg: DataStream[(Int, Long)] = measures
  .keyBy(_._2) // measured_by_id
  .timeWindow(Time.seconds(60), Time.seconds(5))
  .apply( (id: Int, w: TimeWindow, v: Iterable[(Int, Int, String)], out: Collector[(Int, Long)]) => {
    // do your aggregation
  })

val joined: DataStream[(Int, Long, String)] = agg
  .keyBy(_._1) // measured_by_id
  .connect(who.keyBy(_.who_measured_id))
  .flatMap(new JoiningCoFlatMap)

// CoFlatMapFunction
class JoiningCoFlatMap extends RichCoFlatMapFunction[(Int, Long), WhoMeasured, (Int, Long, String)] {

  var names: ValueState[String] = null

  override def open(conf: Configuration): Unit = {
    val stateDescrptr = new ValueStateDescriptor[String](
      "whoMeasuredName",
      classOf[String],
      ""                 // default value
    )
    names = getRuntimeContext.getState(stateDescrptr)
  }

  override def flatMap1(a: (Int, Long), out: Collector[(Int, Long, String)]): Unit = {
    // join with state
    out.collect( (a._1, a._2, names.value()) )
  }

  override def flatMap2(w: WhoMeasured, out: Collector[(Int, Long, String)]): Unit = {
    // update state
    names.update(w.name)
  }
}

关于实施的说明:CoFlatMapFunction无法决定要处理哪个输入,即根据运营商的数据来调用flatmap1flatmap2函数。它不能由功能控制。初始化状态时这是一个问题。在开始时,状态可能没有到达的Measurement对象的正确名称,但返回默认值。您可以通过缓冲测量并加入一次来避免这种情况,来自who流的密钥的第一次更新到达。你需要另一个州。