我想在Apache Beam(使用Java)中使用FileIO
和writeDynamic()
通过密钥将密钥,值对中的值写入GCS中的文本文件。
到目前为止,我正在从Big Query读取数据,将其转换为键,值对,然后尝试将FileIO与writeDynamic()
结合使用,以将每个键的值写入一个文件。
PCollection<TableRow> inputRows = p.apply(BigQueryIO.readTableRows()
.from(tableSpec)
.withMethod(Method.DIRECT_READ)
.withSelectedFields(Lists.newArrayList("id", "string1", "string2", "string3", "int1")));
inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
.via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
.apply(FileIO.<String, KV<String, String>>writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("gs://bucket/output")
.withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));
我收到错误:
The method apply
(PTransform<? super PCollection<KV<Integer,String>>,OutputT>)
in the type PCollection<KV<Integer,String>>
is not applicable for the arguments
(FileIO.Write<String,KV<String,String>>)
答案 0 :(得分:0)
类型不匹配。请注意,TableRow
元素被解析为KV<Integer, String>
中的MapElements
(即,键是Integer
)。然后,写入步骤将期待像String
中的.apply(FileIO.<String, KV<String, String>>writeDynamic()
键:
inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
.via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
.apply(FileIO.<String, KV<String, String>>writeDynamic()
.by(KV::getKey)
...
为避免在使用.by(KV::getKey)
时必须再次强制转换键,我建议在之前将其强制转换为String
:
inputRows
.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
.via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
.apply(FileIO.<String, KV<String, String>>writeDynamic()
.by(KV::getKey)
作为示例,我使用公用表bigquery-public-data:london_bicycles.cycle_stations
对此进行了测试,在该公用表中,我将每个自行车站写入不同的文件:
$ cat output/file-746-00000-of-00004.txt
Lots Road, West Chelsea
$ bq query --use_legacy_sql=false "SELECT name FROM \`bigquery-public-data.london_bicycles.cycle_stations\` WHERE id = 746"
Waiting on bqjob_<ID> ... (0s) Current status: DONE
+-------------------------+
| name |
+-------------------------+
| Lots Road, West Chelsea |
+-------------------------+
完整代码:
package com.dataflow.samples;
import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.transforms.Contextful;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists;
public abstract class DynamicGCSWrites {
public interface Options extends PipelineOptions {
@Validation.Required
@Description("Output Path i.e. gs://BUCKET/path/to/output/folder")
String getOutput();
void setOutput(String s);
}
public static void main(String[] args) {
DynamicGCSWrites.Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DynamicGCSWrites.Options.class);
Pipeline p = Pipeline.create(options);
String output = options.getOutput();
PCollection<TableRow> inputRows = p
.apply(BigQueryIO.readTableRows()
.from("bigquery-public-data:london_bicycles.cycle_stations")
.withMethod(Method.DIRECT_READ)
.withSelectedFields(Lists.newArrayList("id", "name")));
inputRows
.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
.via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
.apply(FileIO.<String, KV<String, String>>writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(output)
.withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));
p.run().waitUntilFinish();
}
}