Question

我的Apache Beam管道实现了自定义的Transforms和ParDo的python模块，这些模块进一步导入了我编写的其他模块。在本地运行器上，这可以正常工作，因为所有可用文件都位于同一路径中。如果是DataflowRunner，管道会失败，并出现模块导入错误。

如何使自定义模块可用于所有数据流工作人员？请告知。

下面是一个示例：

ImportError: No module named DataAggregation

    at find_class (/usr/lib/python2.7/pickle.py:1130)
    at find_class (/usr/local/lib/python2.7/dist-packages/dill/dill.py:423)
    at load_global (/usr/lib/python2.7/pickle.py:1096)
    at load (/usr/lib/python2.7/pickle.py:864)
    at load (/usr/local/lib/python2.7/dist-packages/dill/dill.py:266)
    at loads (/usr/local/lib/python2.7/dist-packages/dill/dill.py:277)
    at loads (/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py:232)
    at apache_beam.runners.worker.operations.PGBKCVOperation.__init__ (operations.py:508)
    at apache_beam.runners.worker.operations.create_pgbk_op (operations.py:452)
    at apache_beam.runners.worker.operations.create_operation (operations.py:613)
    at create_operation (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:104)
    at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:130)
    at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:642)

Answer 1

问题可能是您尚未将文件分组为一个包。 Beam文档上有a section。

多个文件依赖项

通常，您的管道代码跨越多个文件。要远程运行项目，必须将这些文件分组为Python软件包，并在运行管道时指定该软件包。当远程工作者启动时，他们将安装您的软件包。要将文件分组为Python软件包并使其可远程使用，请执行以下步骤：
为您的项目创建一个setup.py文件。以下是一个非常基本的setup.py文件。
setuptools.setup(
    name='PACKAGE-NAME'
    version='PACKAGE-VERSION',
    install_requires=[],
    packages=setuptools.find_packages(),
)
结构化项目，使根目录包含setup.py文件，主工作流程文件以及包含其余文件的目录。
root_dir/
    setup.py
    main.py
    other_files_dir/
有关遵循此必需项目结构的示例，请参见Juliaset。
使用以下命令行选项运行管道：
--setup_file /path/to/setup.py
注意：如果您创建了requirements.txt文件，并且您的项目跨越了多个文件，则可以摆脱requirements.txt文件，而是将Requirements.txt中包含的所有软件包添加到安装程序调用的install_requires字段（在步骤1中）。

Answer 2

我遇到了同样的问题，不幸的是，文档没有他们需要的那么冗长。因此，问题是 root_dir 和 other_files_dir 都必须包含一个 __init__.py 文件。当一个目录包含一个 __init__.py 文件时（即使它是空的），python 会将该目录视为一个包，在这种情况下，这就是我们想要的。因此，您的最终文件夹结构应如下所示：

root_dir/
    __init__.py
    setup.py
    main.py
    other_files_dir/
        __init__.py
        module_1.py
        module_2.py

您会发现，python 将构建一个 .egg-info 文件夹，该文件夹描述您的包，包括所有 pip 依赖项。它还将包含 top_level.txt 文件，该文件包含保存模块的目录的名称（即 other_files_dir）

然后您只需调用 main.py 中的模块，如下所示

from other_files_dir import module_1

Google Dataflow-无法导入自定义python模块

2 个答案:

多个文件依赖项