我在 Python 中创建了一个流式数据流管道,只是想说明我下面的代码是否按我的预期运行。这就是我打算做的:
这是 Python 中的代码片段
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
我可以知道上面的代码是否在做我想要做的事情吗?从 Pub/Sub 流式传输,每 60 秒,它会批量插入 BigQuery。我特意将 max_files_per_bundle 设置为 1 以防止创建 1 个以上的分片,以便每分钟只加载 1 个文件,但不确定我是否做得对。 Java 版本有 withNumFileShards 选项,但我在 Python 中找不到等效项。我参考以下文档: https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
只是想知道我是否应该使用窗口来实现我打算做的事情?
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| 'Window' >> beam.WindowInto(window.FixedWindows(60), trigger=AfterProcessingTime(60),
accumulation_mode=AccumulationMode.DISCARDING)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
没有第二种方法中的窗口,第一种方法是否足够好?我现在正在使用第一种方法,但我不确定是否每分钟都是 从多个文件进行多次加载,还是实际上将所有发布/订阅消息合并为 1 并进行一次批量加载?< /p>
谢谢!