用另一个流丰富一个流

时间:2016-07-28 19:31:56

标签: hadoop apache-flink bigdata

我想使用Apache Flink进行以下操作。我有一个主流,必须通过另一个流的数据来丰富。此主流具有属性“site”和“timestamp”的元素。另一个流(我们称之为countrystream)具有“site”和“country”属性。 countrystream应该跟踪用于站点的最新国家/地区。例如,如果("klm.com", "netherlands")首先到达,一段时间后元组("klm.com", "france")到达,则“klm.com”应指向“france”(因为这是后者)。所以,它应该维持一个状态。假设一个元组(“klm.com”,100)到达主流。现在应该将其丰富到("klm.com", 100, "france")。如果在countrystream中找不到某个站点,则应该使用“?”进行丰富。例如,("stackoverflow.com", 150, "?")。我怎么能达到这个目的呢?

1 个答案:

答案 0 :(得分:0)

我找到了一个解决方案(花了我一些时间)。这有效吗?可以改进吗?这是否意味着我的迭代流不能有检查点?

val env = StreamExecutionEnvironment.getExecutionEnvironment

val mainStream = env.fromElements("a", "a", "b", "a", "a", "b", "b", "a", "c", "b", "a", "c")
val infoStream = env.fromElements((1, "a", "It is F"), (2, "b", "It is B"), (3, "c", "It is C"), (4, "a", "Whoops, it is A"))
        .iterate(
            iteration => {
                (iteration, iteration)
            }
        )

mainStream
    .coGroup(infoStream)
        .where[String]((x: String) => x)
        .equalTo(_._2)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(1))) {
            (first: Iterator[String], second: Iterator[(Int, String, String)], out: Collector[(String, String)]) => {
                first.foreach((key: String) => {
                        val matchingRecords = second
                            .filter(_._2 == key)
                        if (matchingRecords.nonEmpty) {
                            val matchingRecord = matchingRecords.maxBy(_._1)
                            out.collect((matchingRecord._2, matchingRecord._3))
                        }
                    }
                )
            }
        }
    .print()

env.execute("proof_of_concept")