BigQuery:根据资料栏分割表格

时间:2018-11-01 20:25:10

标签: google-bigquery gcloud

简短的问题:我想根据一列的不同值将BQ表拆分为多个小表。因此,如果列country具有10个不同的值,则应将表拆分为10个单独的表,每个表具有各自的country数据。最好,如果是通过BQ查询(使用INSERTMERGE等)完成的。

我现在正在做的是将数据导入到gstorage->本地存储->在本地进行拆分,然后将其推入表中(这是一个非常耗时的过程)。

谢谢。

2 个答案:

答案 0 :(得分:1)

如果数据具有相同的架构,只需将其留在一个表中并使用群集功能:https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_clustered_table

#standardSQL
 CREATE TABLE mydataset.myclusteredtable
 PARTITION BY dateCol
 CLUSTER BY country
 OPTIONS (
   description="a table clustered by country"
 ) AS (
   SELECT ....
 )

https://cloud.google.com/bigquery/docs/clustered-tables

该功能处于测试版。

答案 1 :(得分:1)

您可以为此使用数据流。 This answer给出了一个管道示例,该管道查询一个BigQuery表,根据一个列拆分行,然后将它们输出到不同的PubSub主题(可以是不同的BigQuery表):

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection<TableRow> weatherData = p.apply(
        BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));

final TupleTag<String> readings2010 = new TupleTag<String>() {
};
final TupleTag<String> readings2000plus = new TupleTag<String>() {
};
final TupleTag<String> readingsOld = new TupleTag<String>() {
};

PCollectionTuple collectionTuple = weatherData.apply(ParDo.named("tablerow2string")
        .withOutputTags(readings2010, TupleTagList.of(readings2000plus).and(readingsOld))
        .of(new DoFn<TableRow, String>() {
            @Override
            public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {

                if (c.element().getF().get(2).getV().equals("2010")) {
                    c.output(c.element().toString());
                } else if (Integer.parseInt(c.element().getF().get(2).getV().toString()) > 2000) {
                    c.sideOutput(readings2000plus, c.element().toString());
                } else {
                    c.sideOutput(readingsOld, c.element().toString());
                }

            }
        }));
collectionTuple.get(readings2010)
        .apply(PubsubIO.Write.named("WriteToPubsub1").topic("projects/fh-dataflow/topics/bq2pubsub-topic1"));
collectionTuple.get(readings2000plus)
        .apply(PubsubIO.Write.named("WriteToPubsub2").topic("projects/fh-dataflow/topics/bq2pubsub-topic2"));
collectionTuple.get(readingsOld)
        .apply(PubsubIO.Write.named("WriteToPubsub3").topic("projects/fh-dataflow/topics/bq2pubsub-topic3"));

p.run();