Question

我正在使用Dataflow SDK 2.X Java API（Apache Beam SDK）将数据写入mysql。我已经基于Apache Beam SDK documentation创建了管道，使用数据流将数据写入mysql。它在我需要实现批量插入的时候插入单行。我在官方文档中找不到任何选项来启用批量插入模式。

想知道，如果可以在数据流管道中设置批量插入模式吗？如果是，请在下面的代码中告诉我需要更改的内容。

 .apply(JdbcIO.<KV<Integer, String>>write()
      .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
            "com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
          .withUsername("username")
          .withPassword("password"))
      .withStatement("insert into Person values(?, ?)")
      .withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
        public void setParameters(KV<Integer, String> element, PreparedStatement query) {
          query.setInt(1, kv.getKey());
          query.setString(2, kv.getValue());
        }
      })

Answer 1

编辑2018-01-27：

事实证明，此问题与DirectRunner有关。如果使用DataflowRunner运行相同的管道，则应该获得实际最多1,000条记录的批次。在分组操作之后，DirectRunner始终会创建大小为1的包。

原始回答：

使用Apache Beam的JdbcIO写入云数据库时，我遇到了同样的问题。问题是虽然JdbcIO确实支持在一个批处理中写入多达1,000条记录，但我从未真正看到它一次写入超过1行（我不得不承认：这总是在开发环境中使用DirectRunner）。

因此，我向JdbcIO添加了一项功能，您可以通过将数据分组在一起并将每个组编写为一个批次来自行控制批次的大小。下面是如何基于Apache Beam的原始WordCount示例使用此功能的示例。

p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
    // Count words in input file(s)
    .apply(new CountWords())
    // Format as text
    .apply(MapElements.via(new FormatAsTextFn()))
    // Make key-value pairs with the first letter as the key
    .apply(ParDo.of(new FirstLetterAsKey()))
    // Group the words by first letter
    .apply(GroupByKey.<String, String> create())
    // Get a PCollection of only the values, discarding the keys
    .apply(ParDo.of(new GetValues()))
    // Write the words to the database
    .apply(JdbcIO.<String> writeIterable()
            .withDataSourceConfiguration(
                JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
            .withStatement(INSERT_OR_UPDATE_SQL)
            .withPreparedStatementSetter(new WordCountPreparedStatementSetter()));

与JdbcIO的正常写入方法的区别在于新方法writeIterable()，它将PCollection<Iterable<RowT>>作为输入而不是PCollection<RowT>。每个Iterable都作为一个批处理写入数据库。

可以在此处找到带有此添加项的JdbcIO版本：https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

可以在此处找到包含上述示例的整个示例项目：https://github.com/olavloite/spanner-beam-example

（在Apache Beam上还有一个挂起请求，要求将其包含在项目中）

Google Dataflow（Apache beam）将JdbcIO批量插入到mysql数据库中

1 个答案: