Apache Flink - 序列化json并执行连接操作

时间:2016-02-17 07:42:27

标签: join stream apache-kafka apache-flink

我正在尝试使用Jackson库从Kafka主题中读取字符串,并从另一个流中执行连接。

以下是包含两个数据流的示例代码。我想对这些消息流执行连接操作。

比如说,传入的流是:

messageStream1 = {"A":"a"}
messageStream2 = {"B":"a"}

加入条件为messageStream1."A" = messageStream2."B"。我如何在Flink中实现这一点?

DataStream 1:

DataStream<String> messageStream1 = env.addSource(
  new FlinkKafkaConsumer082<String>("input", new SimpleStringSchema() , parameterTool.getProperties()));

messageStream1.map(new MapFunction<String, JsonNode>() {
    @Override
    public JsonNode map(String value) throws Exception {
        JsonFactory factory = new JsonFactory();
        ObjectMapper mapper = new ObjectMapper(factory);
        try {
            JsonNode rootNode = mapper.readTree(value);
            Iterator<Map.Entry<String,JsonNode>> fieldsIterator = rootNode.fields();
            while (fieldsIterator.hasNext()) {
                Map.Entry<String,JsonNode> field = fieldsIterator.next();
                System.out.println("Key: " + field.getKey() + "\tValue:" + field.getValue());
            }
            return rootNode;
        }catch (java.io.IOException ex){
            ex.printStackTrace();
            return null;
        }
    }
});

DataStream 2:

DataStream<String> messageStream2 = env.addSource(
  new FlinkKafkaConsumer082<String>("input", new SimpleStringSchema() , parameterTool.getProperties()));

messageStream2.map(new MapFunction<String, JsonNode>() {
    @Override
    public JsonNode map(String value) throws Exception {
        JsonFactory factory = new JsonFactory();
        ObjectMapper mapper = new ObjectMapper(factory);
        try {
            JsonNode rootNode = mapper.readTree(value);
            Iterator<Map.Entry<String,JsonNode>> fieldsIterator = rootNode.fields();
            while (fieldsIterator.hasNext()) {
                Map.Entry<String,JsonNode> field = fieldsIterator.next();
                System.out.println("Key: " + field.getKey() + "\tValue:" + field.getValue());
            }
            return rootNode;
        }catch (java.io.IOException ex){
            ex.printStackTrace();
            return null;
        }
    }
});

1 个答案:

答案 0 :(得分:2)

您需要将关键字段提取到额外属性中,以便Flink可以访问它(另一种方法是提供自定义键选择器:https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#specifying-keys)。

因此,map(...)的返回类型可能为Tuple2<String,JsonNode>(如果String是您的加入属性的正确类型)。

然后,您可以按照文档(https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html)中所述指定您的联接:

messageStream1.join(messageStream2)
    .where(0).equalTo(0) // both zeros indicate that the join happens on the zero's attribute, ie, the String attribute of Tuple2
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
    .apply(new JoinFunction() {...});

要使用DataStream API执行加入,您还需要指定加入窗口。只能连接属于同一窗口的元组。