Flink StreamingFileSink未将数据写入AWS S3

时间:2020-01-20 02:32:37

标签: apache-flink flink-streaming

我有一个表示数据流的集合,并测试StreamingFileSink将流写入S3。程序运行成功,但是给定的S3路径中没有数据。

    public class S3Sink {

    public static void main(String args[]) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
        see.enableCheckpointing(100);

        List<String> input = new ArrayList<>();
        input.add("test");

        DataStream<String> inputStream = see.fromCollection(input);

        RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();

        StreamingFileSink s3Sink = StreamingFileSink.
                forRowFormat(new Path("<S3 Path>"),
                new SimpleStringEncoder<>("UTF-8"))
                .withRollingPolicy(rollingPolicy)
                .build();


        inputStream.addSink(s3Sink);

        see.execute();
    }
}

也启用了检查点。关于为什么Sink不能按预期工作的任何想法?

更新: 根据David的回答,创建了自定义源,该源连续生成随机字符串,我希望Checkpointing在配置间隔后触发,以将数据写入S3。

public class S3SinkCustom {

    public static void main(String args[]) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
        see.enableCheckpointing(1000);

        DataStream<String> inputStream = see.addSource(new CustomSource());

        RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();

        StreamingFileSink s3Sink = StreamingFileSink.
                forRowFormat(new Path("s3://mybucket/data/"),
                new SimpleStringEncoder<>("UTF-8"))
                .build();


        //inputStream.print();

        inputStream.addSink(s3Sink);

        see.execute();
    }

    static class CustomSource extends RichSourceFunction<String> {

        private volatile boolean running = false;

        final String[] strings = {"ABC", "XYZ", "DEF"};

        @Override
        public void open(Configuration parameters){
            running = true;
        }

        @Override
        public void run(SourceContext sourceContext) throws Exception {
            while (running) {
                Random random = new Random();
                int index = random.nextInt(strings.length);
                sourceContext.collect(strings[index]);
                Thread.sleep(1000);
            }
        }

        @Override
        public void cancel() {
            running = false;
        }
    }

}

仍然,s3中没有数据,并且在S3存储桶有效或无效的情况下,Flink进程甚至都没有验证,但是该进程运行没有任何问题。

更新:

以下是自定义滚动策略的详细信息:

public class CustomRollingPolicy implements RollingPolicy<Object, String> {

    @Override
    public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
        return partFileInfo.getSize() > 1;
    }

    @Override
    public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
        return true;
    }

    @Override
    public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
        return true;
    }
}

1 个答案:

答案 0 :(得分:0)

使用所需的s3a属性(例如fs.s3a.access.key,fs.s3a.secret.key)设置flink-conf.yaml后,上述问题已解决。

我们还需要让Flink知道配置位置。

FileSystem.initialize(GlobalConfiguration.loadConfiguration(“”));

通过这些更改,我能够从本地运行S3接收器,并且消息可以持久保存到S3中。