Question

有没有办法动态地声明元组中的各种类型？

我找到了一种动态声明元组中列数的方法：

env.readCsvFile(filePath).tupleType(Tuple.getTupleClass(3))

但是没有任何类型参数，它会抛出错误：

Exception in thread "main" org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to be parameterized by using generics.

我想将元组中的所有元素用作简单的String。以下作品：

env.readCsvFile(filePath).types(String.class, String.class);

这会产生Tuple2(String,String)类型。但就我而言，我不知道csv中有多少列数据。但我很好地阅读所有列作为字符串。（我知道最多25列的限制）

我甚至尝试通过指定CsvInputFormat

的子类型进行阅读

env.readFile(new TupleCsvInputFormat(filePath,TypeInformation.of(String.class), filePath);

但无法编译。不知道如何使用这个为我的情况。我也不确定如何扩展Tuple类来实现相同（如果可能的话）。 TypeHint似乎要求我先了解列数。

我不确定其他env.read...()方法。我尝试了一些，但是ignoreFirstLine()之类的一些方法不可用。它们只带有CsvReader。

所以，如果列的数量可以是任意的（通过输入传递），并且可以将Tuple的每个元素作为简单的{{}读取，那么有人可以帮助我找出读取csv的最佳方法吗？ 1}}？

Answer 1

可以编写自己的方法来读取CSV文件。也许是这样的：

public static void main(String[] args) throws Exception {
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    int n = 3; // number of columns here
    Class[] types = IntStream.range(0, n).mapToObj(i -> String.class).toArray(Class[]::new);
    DataSet<Tuple> csv = readCsv(env, "filename.csv", types);
    csv.print();
}

private static DataSource<Tuple> readCsv(ExecutionEnvironment env, String filename, Class[] fieldTypes) {
    TupleTypeInfo<Tuple> typeInfo = TupleTypeInfo.getBasicAndBasicValueTupleTypeInfo(fieldTypes);
    TupleCsvInputFormat<Tuple> inputFormat = new TupleCsvInputFormat<>(new Path(filename), typeInfo);
    return new DataSource<>(env, inputFormat, typeInfo, Utils.getCallLocationName());
}

注意：此方法会跳过configureInputFormat类中的CsvReader方法调用。如果你需要它，你可以做到。

Flink：声明动态元组大小＆amp;类型

1 个答案: