我有一个简单的数据流管道,可以在本地计算机上成功运行:
import argparse
import logging
import ast
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions
from apache_beam.io.gcp.internal.clients import bigquery
def parse_args_set_logging(argv=None):
"""
parse command line arguments
"""
parser = argparse.ArgumentParser()
parser.add_argument('--verbose',
action='store_true',
help='set the logging level to debug')
parser.add_argument('--topic',
default=<my topic>,
help='GCP pubsub topic to subscribe to')
known_args, pipeline_args = parser.parse_known_args(argv)
# set logging level
logging.basicConfig()
if known_args.verbose:
logging.getLogger().setLevel(logging.INFO)
return known_args, pipeline_args
class formatForBigQueryDoFn(beam.DoFn):
def record_handler(self, data):
"""
Build a dictionary ensuring format matches BigQuery table schema
"""
return {
"uid": data['uid'],
"interaction_type": data['interaction_type'],
"interaction_asset_id": data['interaction_asset_id'],
"interaction_value": data['interaction_value'],
"timestamp": data['timestamp'],
}
def process(self, element):
# extract data from the PubsubMessage python object and convert to python dict
data = ast.literal_eval(element.data)
logging.info("ELEMENT OBJECT: {}".format(data))
# format the firestore timestamp for bigquery
data['timestamp'] = data['timestamp']['_seconds']
# construct the data for bigquery
result = self.record_handler(data)
return [result]
if __name__ == '__main__':
known_args, pipeline_args = parse_args_set_logging()
# create a pipeline object
pipeline_options = GoogleCloudOptions(pipeline_args)
p = beam.Pipeline(options=pipeline_options)
# create a PCollection from the GCP pubsub topic
inputCollection = p | beam.io.ReadFromPubSub(
topic=known_args.topic,
# id_label='id', # unique identifier in each record to be processed
with_attributes=True, # output PubsubMessage objects
)
# chain together multiple transform methods, to create a new PCollection
OutputCollection = inputCollection | beam.ParDo(formatForBigQueryDoFn())
# write the resulting PCollection to BigQuery
table_spec = <my table spec>
table_schema = 'uid:STRING, interaction_type:STRING, interaction_asset_id:STRING, interaction_value:STRING, timestamp:TIMESTAMP'
OutputCollection | beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
# run the pipeline
result = p.run().wait_until_finish()
我正在尝试使用GCP数据流运行此代码。为此,我需要安装python依赖项AST
。我尝试创建requirements.txt
并使用--requirements_file
参数,但没有成功。我正在尝试使用setup.py
。跟随the docs之后,我的setup.py
看起来像这样:
import setuptools
setuptools.setup(
name='pubsub_to_BQ',
version='1.0',
install_requires=[
'AST'
],
packages=setuptools.find_packages(),
)
我正在使用以下命令在GCP上运行:
python main.py --runner DataflowRunner \
--setup_file ./setup.py \
--project <myproject> \
--temp_location <my bucket> \
--verbose \
--streaming \
--job_name bigqueryinteractions
但是,当管道处理数据时,出现以下错误:
File "main.py", line 47, in process
NameError: global name 'ast' is not defined [while running 'generatedPtransform-54']
我该如何解决?
答案 0 :(得分:1)
AFAIK如果您通过Shell命令行指定setup.py
,则应该使用绝对路径,也应使用Dataflow
尝试布尔标志save_main_session
,因为没有它,您的部署模板将无法解决依赖关系在setup.py
中指定。
对于管线而言非动态的参数可以在管线构建期间解决。
例如,通过这种方法,您可以硬编码一些需要始终传递的不变参数,因此您只需指定随执行而变化的参数即可。
known_args, pipe_args = parser.parse_known_args()
standard_pipe_arg = ['--save_main_session', 'setup_file=./setup.py', '--streaming']
pipe_opts = PipelineOptions(pipe_args + standard_pipe_arg)
答案 1 :(得分:0)
我发现了一种解决方法,使用json
库而不是ast
。我仍然想知道我在做什么错。