我们有两种消息来到Flink
我们为这两个消息提供了单独的源流。我们已经为这两个流附加了相同的接收器。 我们想要做的是广播控制消息,以便所有并行运行的接收器都应该接收它。
以下是相同的代码:
package com.ranjit.com.flinkdemo;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.fs.DateTimeBucketer;
import org.apache.flink.streaming.connectors.fs.RollingSink;
import org.apache.flink.streaming.connectors.fs.StringWriter;;
public class FlinkBroadcast {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
DataStream<String> ctrl_message_stream = env.socketTextStream("localhost", 8088);
ctrl_message_stream.broadcast();
DataStream<String> message_stream = env.socketTextStream("localhost", 8087);
RollingSink sink = new RollingSink<String>("/base/path");
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"));
sink.setWriter(new StringWriter<String>() );
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,
ctrl_message_stream.broadcast().addSink(sink);
message_stream.addSink(sink);
env.execute("stream");
}
}
但是我观察到的是,它创建了4个接收器实例,并且控制消息被广播到仅2个接收器(由控制消息流创建)。 所以我所理解的是两个流应该通过相同的运营商链来做这个我们不想要的,因为数据消息会有多次转换。 我们编写了自己的接收器,如果它是控制消息,它将读取消息,然后它只会滚动文件。
示例代码:
package com.gslab.com.dataSets;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class FlinkBroadcast {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
List<String> controlMessageList = new ArrayList<String>();
controlMessageList.add("controlMessage1");
controlMessageList.add("controlMessage2");
List<String> dataMessageList = new ArrayList<String>();
dataMessageList.add("Person1");
dataMessageList.add("Person2");
dataMessageList.add("Person3");
dataMessageList.add("Person4");
DataStream<String> controlMessageStream = env.fromCollection(controlMessageList);
DataStream<String> dataMessageStream = env.fromCollection(dataMessageList);
DataStream<GenericRecord> controlMessageGenericRecordStream = controlMessageStream.map(new MapFunction<String, GenericRecord>() {
@Override
public GenericRecord map(String value) throws Exception {
Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/controlMessageSchema.avsc")));
gr.put("TYPE", value);
return gr;
}
});
DataStream<GenericRecord> dataMessageGenericRecordStream = dataMessageStream.map(new MapFunction<String, GenericRecord>() {
@Override
public GenericRecord map(String value) throws Exception {
Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/dataMessageSchema.avsc")));
gr.put("FIRSTNAME", value);
gr.put("LASTNAME", value+": lastname");
return gr;
}
});
//Displaying Generic records
dataMessageGenericRecordStream.map(new MapFunction<GenericRecord, GenericRecord>() {
@Override
public GenericRecord map(GenericRecord value) throws Exception {
System.out.println("data before union: "+ value);
return value;
}
});
controlMessageGenericRecordStream.broadcast().union(dataMessageGenericRecordStream).map(new MapFunction<GenericRecord, GenericRecord>() {
@Override
public GenericRecord map(GenericRecord value) throws Exception {
System.out.println("data after union: " + value);
return value;
}
});
env.execute("stream");
}
}
输出:
05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED
05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED
05/09/2016 13:02:13 Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(2/2) switched to FINISHED
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2: lastname"}
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1: lastname"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1"}
data before union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4: lastname"}
data before union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3: lastname"}
data after union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2"}
data after union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3"}
05/09/2016 13:02:13 Map -> Map(2/2) switched to FINISHED
data after union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4"}
05/09/2016 13:02:13 Map -> Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(2/2) switched to FINISHED
05/09/2016 13:02:13 Job execution switched to status FINISHED.
我们可以看到LASTNAME值不正确,它会被每条记录的FIRSTNAME值替换
答案 0 :(得分:2)
您的代码基本上使用您自己定义的接收器副本终止两个流。你想要的是这样的东西:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
DataStream<String> ctrl_message_stream = env.socketTextStream("localhost", 8088);
DataStream<String> message_stream = env.socketTextStream("localhost", 8087);
RollingSink sink = new RollingSink<String>("/base/path");
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"));
sink.setWriter(new StringWriter<String>() );
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,
ctrl_message_stream.broadcast().union(message_stream).addSink(sink);
env.execute("stream");