Question

我正在为Apache Flink实现自定义输入格式。我创建了一个返回3行的虚拟输入格式。

public class ElasticsearchInputFormat extends GenericInputFormat<Row> {
    @Override
    public void configure(Configuration parameters) {
        System.out.println("configuring");
    }

    @Override
    public BaseStatistics getStatistics(BaseStatistics cachedStatistics) throws IOException {
        return cachedStatistics;
    }

    @Override
    public void open(GenericInputSplit split) throws IOException {
        System.out.println("opening: " + split);
        super.open(split);
    }

    @Override
    public void close() throws IOException {
        System.out.println("closing");
        super.close();
    }

    private int a = 0;

    public boolean reachedEnd() throws IOException {
        a++;
        return a > 3;
    }

    public Row nextRecord(Row reuse) throws IOException {
        Row r = new Row(2);
        r.setField(0, "osman");
        r.setField(1, "wow");
        return r;
    }
}

我的示例代码如下：

final ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
env.setParallelism(8);

DataSource<Row> input = env.createInput(new ElasticsearchInputFormat());

input.print();

但是，虽然并行度设置为8，但它会打印：

configuring
opening: GenericSplit (0/1)
closing
osman,wow
osman,wow
osman,wow

为什么不并行化？我希望有多个拆分，因此可以由其他运营商并行使用。

Answer 1

createCollectionsEnvironment()返回一个具有1的隐式并行性的特殊环境。来自Javadocs ......

创建一个使用Java Collections的{@link CollectionEnvironment} 下。这将在当前的单个线程中执行 JVM。它非常快，但如果数据不适合则会失败记忆。并行性总是为1.这在期间非常有用实现和调试。

Apache Flink：自定义InputFormat仅以1的并行度运行

1 个答案: