Question

我有一个简单的数据流管道，可以在本地计算机上成功运行：

import argparse
import logging
import ast
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions
from apache_beam.io.gcp.internal.clients import bigquery


def parse_args_set_logging(argv=None):
    """
    parse command line arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument('--verbose',
                        action='store_true',
                        help='set the logging level to debug')
    parser.add_argument('--topic',
                        default=<my topic>,
                        help='GCP pubsub topic to subscribe to')

    known_args, pipeline_args = parser.parse_known_args(argv)

    # set logging level
    logging.basicConfig()
    if known_args.verbose:
        logging.getLogger().setLevel(logging.INFO)

    return known_args, pipeline_args


class formatForBigQueryDoFn(beam.DoFn):
    def record_handler(self, data):
        """
        Build a dictionary ensuring format matches BigQuery table schema
        """
        return {
            "uid": data['uid'],
            "interaction_type": data['interaction_type'],
            "interaction_asset_id": data['interaction_asset_id'],
            "interaction_value": data['interaction_value'],
            "timestamp": data['timestamp'],
        }

    def process(self, element):

        # extract data from the PubsubMessage python object and convert to python dict
        data = ast.literal_eval(element.data)
        logging.info("ELEMENT OBJECT: {}".format(data))

        # format the firestore timestamp for bigquery
        data['timestamp'] = data['timestamp']['_seconds']

        # construct the data for bigquery
        result = self.record_handler(data)
        return [result]


if __name__ == '__main__':
    known_args, pipeline_args = parse_args_set_logging()

    # create a pipeline object
    pipeline_options = GoogleCloudOptions(pipeline_args)
    p = beam.Pipeline(options=pipeline_options)

    # create a PCollection from the GCP pubsub topic
    inputCollection = p | beam.io.ReadFromPubSub(
        topic=known_args.topic,
        # id_label='id',  # unique identifier in each record to be processed
        with_attributes=True,  # output PubsubMessage objects
    )

    # chain together multiple transform methods, to create a new PCollection
    OutputCollection = inputCollection | beam.ParDo(formatForBigQueryDoFn())

    # write the resulting PCollection to BigQuery
    table_spec = <my table spec>
    table_schema = 'uid:STRING, interaction_type:STRING, interaction_asset_id:STRING, interaction_value:STRING, timestamp:TIMESTAMP'

    OutputCollection | beam.io.WriteToBigQuery(
        table_spec,
        schema=table_schema,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)

    # run the pipeline
    result = p.run().wait_until_finish()

我正在尝试使用GCP数据流运行此代码。为此，我需要安装python依赖项AST。我尝试创建requirements.txt并使用--requirements_file参数，但没有成功。我正在尝试使用setup.py。跟随the docs之后，我的setup.py看起来像这样：

import setuptools

setuptools.setup(
    name='pubsub_to_BQ',
    version='1.0',
    install_requires=[
        'AST'
    ],
    packages=setuptools.find_packages(),
)

我正在使用以下命令在GCP上运行：

python main.py --runner DataflowRunner \
               --setup_file ./setup.py \
               --project <myproject> \
               --temp_location <my bucket> \
               --verbose \
               --streaming \
               --job_name bigqueryinteractions

但是，当管道处理数据时，出现以下错误：

File "main.py", line 47, in process NameError: global name 'ast' is not defined [while running 'generatedPtransform-54']

我该如何解决？

Answer 1

AFAIK如果您通过Shell命令行指定setup.py，则应该使用绝对路径，也应使用Dataflow尝试布尔标志save_main_session，因为没有它，您的部署模板将无法解决依赖关系在setup.py中指定。

对于管线而言非动态的参数可以在管线构建期间解决。

例如，通过这种方法，您可以硬编码一些需要始终传递的不变参数，因此您只需指定随执行而变化的参数即可。

known_args, pipe_args = parser.parse_known_args()
standard_pipe_arg = ['--save_main_session', 'setup_file=./setup.py', '--streaming']
pipe_opts = PipelineOptions(pipe_args + standard_pipe_arg)

Answer 2

我发现了一种解决方法，使用json库而不是ast。我仍然想知道我在做什么错。

数据流管道python依赖安装但无法导入

2 个答案: