Question

我正在尝试将我从Google Analytics（分析）和Google Cloud Storage中读取的值写入到Google Big Query。我正在编写一个流作业，当它从Pub / Sub接收时将写入数据。我主要担心的是，我无法在map函数中执行Big Query操作（这使得很难获取该值是否已存在于大查询表中）

流程如下：

Google Analytics（分析）-> PubSub->处理（MAP）函数-> Pcollection到Big Query（io.WritetoBigquery）操作。

该处理对一张Excel工作表和PubSub记录之间的值进行了一些检查

现在，我想删除要输入到我使用的Big Query表中的重复值。我尝试给出一个id_label（但是我需要给出属性而不是键标签）。我尝试使用Map从大型查询中获取count，但由于无法识别bigquery.Client()，因此无法正常工作。请在下面找到我的代码段：

管道：

pipeline_options1 = PipelineOptions(pipeline_args)
pipeline_options1.view_as(SetupOptions).save_main_session = True
pipeline_options1.view_as(StandardOptions).streaming = True

p_bq = beam.Pipeline(options=pipeline_options1)

logging.info('Start')

"""Pipeline starts. Create creates a PCollection from what we read from Cloud storage"""
test = p_bq | beam.Create(data)

"""The pipeline then reads from pub sub and then combines the pub sub with the cloud storage data"""
BQ_data1 = p_bq | 'readFromPubSub' >> beam.io.ReadFromPubSub(
    'topic',id_label="hello") |  beam.Map(parse_pubsub, param=AsList(test))
BQ_data1 | beam.io.WriteToBigQuery(table='table', dataset='dataset', project='projectid')
"""Run the pipeline"""
result_bq = p_bq.run()

result_bq.wait_until_finish()

我的计数功能（管道的一部分）：

client = bigquery.Client()
sql_string = """Select count(*) as cout from `dataset.table` where
           id='%s' """ % (insert_record['transactionID'])
query_job = client.query(sql_string)
results = query_job.result()

但是无法识别查询。

如何避免重复数据写入大查询

0 个答案: