Question

我有本地Python包，我想在Apache Beam管道中使用DataFlow Runner。我尝试按照文档中提供的说明进行操作：https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/（本地或非PyPI依赖项部分），但没有成功。

我的包具有以下结构：

my_common
├── __init__.py
└── shared
    ├── __init__.py
    └── something.py

something.py文件包含：

def hello_world():
    return "Hello"

使用python setup.py sdist命令构建包。

现在，我的Apache Beam管道配置如下：

pipeline_parameters = [
    '--project', project_id,
    '--staging_location', staging_location,
    '--temp_location', temp_location,
    '--max_num_workers', 1,
    "--extra_package", "/absolute/path/to/my/package/my_common-1.0.tar.gz"
]


p = beam.Pipeline("DataFlowRunner", argv=pipeline_parameters)
# rest of the pipeline definition

其中一个管道贴图函数具有以下代码，它使用我的模块：

from my_common.shared import something
logging.info(something.hello_world())

每当我将此管道安排到DataFlow时，我都会收到以下错误：

ImportError: No module named shared

有趣的是，当我在另一个环境中安装此软件包（来自.tar.gz）文件时，我可以毫无问题地从中导入和运行函数。在我看来，DataFlow在运行管道之前不会安装软件包。

管理和部署本地Python依赖关系到Google DataFlow的正确方法是什么？

//更新： https://stackoverflow.com/a/46605344/1955346中描述的解决方案对于我的用例是不够的，因为我需要将我的本地包放在完全不同的文件夹中，setup.py因为我的管道已经有了一些内容（我不能使用setup.py of如此建议的外部包装。

Answer 1

不是通过extra-packages提供，而是使用setup_file

提供

使用setuptools定义您的setup_file，它看起来如下

from setuptools import setup

setup(
    name="dataflow_pipeline_dependencies",
    version="1.0.0",
    author="Marcin Zablocki",
    author_email="youemail@domain.com",
    description=("Custom python utils needed for dataflow cloud runner"),
    packages=[
        'my_common'
        ]
)

并使用--setup_file参数传递它，如下所示

pipeline_parameters = [
    '--project', project_id,
    '--staging_location', staging_location,
    '--temp_location', temp_location,
    '--max_num_workers', 1,
    "--setup_file", "/absolute/path/to/your/package/my_common"
]


p = beam.Pipeline("DataFlowRunner", argv=pipeline_parameters)
# rest of the pipeline definition

其中/absolute/path/to/your/package/my_common是存储包目录的路径

Apache Beam本地Python依赖项

1 个答案: