Question

我在Apache Flink api中使用readCsvFile(path)函数来读取CSV文件并将其存储在列表变量中。它如何使用多个线程？例如，是否根据某些统计信息拆分文件？如果有，有什么统计数据？或者它是否逐行读取文件，然后将这些行发送给线程来处理它们？

以下是示例代码：

//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
                        .types(String.class,Double.class)
                        .collect();

假设我们在本地磁盘上有一个800mb的CSV文件，它如何在这4个线程之间分配工作？

Answer 1

readCsvFile() API方法在内部创建一个基于Flink CsvInputFormat FileInputFormat的数据源。此InputFormat生成所谓的InputSplits列表。 InputSplit定义应扫描文件的哪个范围。然后将拆分分发到数据源任务。

因此，每个并行任务都会扫描文件的某个区域并解析其内容。这与MapReduce / Hadoop完成的方式非常相似。

Answer 2

这与How does Hadoop process records split across block boundaries?

相同

我从flink-release-1.1.3 DelimitedInputFormat 文件中提取了一些代码。

    // else ..
    int toRead;
    if (this.splitLength > 0) {
        // if we have more data, read that
        toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
    }
    else {
        // if we have exhausted our split, we need to complete the current record, or read one
        // more across the next split.
        // the reason is that the next split will skip over the beginning until it finds the first
        // delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
        // previous split.
        toRead = this.readBuffer.length;
        this.overLimit = true;
    }

很明显，如果它在一次拆分中没有读取行分隔符，它将获得另一个拆分。（我没有找到相应的代码，我会尝试。）

Plus：下图是我如何找到代码，从readCsvFile（）到DelimitedInputFormat。

Apache Flink如何并行读取CSV文件

2 个答案: