Apache Beam:将密钥的值,值对根据密钥写入文件

时间:2019-11-06 07:30:59

标签: java apache-beam

我想在Apache Beam(使用Java)中使用FileIOwriteDynamic()通过密钥将密钥,值对中的值写入GCS中的文本文件。

到目前为止,我正在从Big Query读取数据,将其转换为键,值对,然后尝试将FileIO与writeDynamic()结合使用,以将每个键的值写入一个文件。

PCollection<TableRow> inputRows = p.apply(BigQueryIO.readTableRows()
    .from(tableSpec)
    .withMethod(Method.DIRECT_READ)
    .withSelectedFields(Lists.newArrayList("id", "string1", "string2", "string3", "int1")));

inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
    .via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("gs://bucket/output")
    .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

我收到错误:

The method apply
  (PTransform<? super PCollection<KV<Integer,String>>,OutputT>) 
  in the type PCollection<KV<Integer,String>> 
  is not applicable for the arguments 
  (FileIO.Write<String,KV<String,String>>)

1 个答案:

答案 0 :(得分:0)

类型不匹配。请注意,TableRow元素被解析为KV<Integer, String>中的MapElements(即,键是Integer)。然后,写入步骤将期待像String中的.apply(FileIO.<String, KV<String, String>>writeDynamic()键:

inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
    .via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
    .by(KV::getKey)
    ...

为避免在使用.by(KV::getKey)时必须再次强制转换键,我建议在之前将其强制转换为String

inputRows
    .apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
        .via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
        .by(KV::getKey)

作为示例,我使用公用表bigquery-public-data:london_bicycles.cycle_stations对此进行了测试,在该公用表中,我将每个自行车站写入不同的文件:

$ cat output/file-746-00000-of-00004.txt 
Lots Road, West Chelsea

$ bq query --use_legacy_sql=false "SELECT name FROM \`bigquery-public-data.london_bicycles.cycle_stations\` WHERE id = 746"
Waiting on bqjob_<ID> ... (0s) Current status: DONE   
+-------------------------+
|          name           |
+-------------------------+
| Lots Road, West Chelsea |
+-------------------------+

完整代码:

package com.dataflow.samples;

import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.transforms.Contextful;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists;


public abstract class DynamicGCSWrites {

    public interface Options extends PipelineOptions {
        @Validation.Required
        @Description("Output Path i.e. gs://BUCKET/path/to/output/folder")
        String getOutput();
        void setOutput(String s);
    }

    public static void main(String[] args) {

        DynamicGCSWrites.Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DynamicGCSWrites.Options.class);

        Pipeline p = Pipeline.create(options);

        String output = options.getOutput();

        PCollection<TableRow> inputRows = p
            .apply(BigQueryIO.readTableRows()
                .from("bigquery-public-data:london_bicycles.cycle_stations")
                .withMethod(Method.DIRECT_READ)
                .withSelectedFields(Lists.newArrayList("id", "name")));

        inputRows
            .apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
                .via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
            .apply(FileIO.<String, KV<String, String>>writeDynamic()
                .by(KV::getKey)
                .withDestinationCoder(StringUtf8Coder.of())
                .via(Contextful.fn(KV::getValue), TextIO.sink())
                .to(output)
                .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

        p.run().waitUntilFinish();
    }
}