apache flink 0.10如何从无界输入dataStream中获取复合键的第一次出现?

时间:2016-02-24 10:09:04

标签: apache-flink flink-streaming

我是apache flink的新手。我的输入中有一个未绑定的数据流(通过kakfa送入flink 0.10)。

我希望得到每个主键的第一个出现(主键是contract_num和event_dt)。
这些"重复"几乎在彼此之后立即发生。 源系统不能为我过滤这个,所以flink必须这样做。

这是我的输入数据:

contract_num, event_dt, attr 
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

这是我想要的输出数据:

A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

请注意,第二行已被删除,作为A001和2016-02-24 10:25:08' 2016-02-24 10:25:08的关键组合。已经出现在第1排。

我怎么能用flink 0.10做到这一点?

我正在考虑使用keyBy(0,1),但之后我不知道该怎么做!

(我使用joda-time和org.flinkspector来设置这些测试)。

@Test
public void test() {
    DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
    DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
    DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);

    DataStream<Tuple3<String, Date, String>> testStream =
            createTimedTestStreamWith(
                    Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
            .emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
            .close();

    testStream.keyBy(0,1);
}

2 个答案:

答案 0 :(得分:5)

如果密钥空间大于可用存储空间,则在无限流上过滤重复项最终将失败。原因是您必须将已经看到的键存储在某处以过滤掉重复项。因此,最好定义一个时间窗口,之后您可以清除当前看到的密钥集。

如果你已经意识到这个问题但想要尝试一下,你可以通过在flatMap电话后应用有状态的keyBy操作来实现。有状态映射器使用Flink的状态抽象来存储它是否已经看到具有该键的元素。这样,您也将受益于Flink的容错机制,因为您的状态将自动检查点。

完成工作的Flink程序可能看起来像

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));

    input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();

    env.execute("Test");
}

DuplicateFilter的实施取决于Flink的版本。

版本&gt; = 1.0实施

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
    private ValueState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getState(descriptor);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            // set operator state to true so that we don't emit elements with this key again
            operatorState.update(true);
        }
    }
}

版本0.10实施

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    private OperatorState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            operatorState.update(true);
        }
    }
}

更新:使用翻滚时间窗口

input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
    @Override
    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
        out.collect(input.iterator().next());
    }
})

答案 1 :(得分:2)

这是我刚刚写完的另一种方式。它的缺点是它有更多的自定义代码,因为它没有使用内置的Flink窗口函数,但它没有Till提到的延迟惩罚。 GitHub上的完整示例。

package com.dataartisans.filters;

import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;

import java.io.Serializable;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;


/**
 * This class filters duplicates that occur within a configurable time of each other in a data stream.
 */
public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {

  private LoadingCache<K, Boolean> dedupeCache;
  private final KeySelector<T, K> keySelector;
  private final long cacheExpirationTimeMs;

  /**
   * @param cacheExpirationTimeMs The expiration time for elements in the cache
   */
  public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
    this.keySelector = keySelector;
    this.cacheExpirationTimeMs = cacheExpirationTimeMs;
  }

  @Override
  public void open(Configuration parameters) throws Exception {
    createDedupeCache();
  }


  @Override
  public boolean filter(T value) throws Exception {
    K key = keySelector.getKey(value);
    boolean seen = dedupeCache.get(key);
    if (!seen) {
      dedupeCache.put(key, true);
      return true;
    } else {
      return false;
    }
  }

  @Override
  public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
    return new HashSet<>(dedupeCache.asMap().keySet());
  }

  @Override
  public void restoreState(HashSet<K> state) throws Exception {
    createDedupeCache();
    for (K key : state) {
      dedupeCache.put(key, true);
    }
  }

  private void createDedupeCache() {
    dedupeCache = CacheBuilder.newBuilder()
      .expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
      .build(new CacheLoader<K, Boolean>() {
        @Override
        public Boolean load(K k) throws Exception {
          return false;
        }
      });
  }
}