我试图在GCP中使用Dataflow。语境化如下:
- 我已经创建了一个在本地正常工作的管道。这是test.py文档脚本:(我做一个子进程函数,它接受脚本" script2.py"执行,脚本位于本地并存储在云端的存储桶中)
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import SetupOptions
project ="titanium-index-200721"
bucket ="pipeline-operation-test"
class catchOutput(beam.DoFn):
def process(self,element):
import subprocess
import sys
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
return [s2_out]
def run():
project = "titanium-index-200721"
job_name = "test-setup-subprocess-newerr"
staging_location = 'gs://pipeline-operation-test/staging'
temp_location = 'gs://pipeline-operation-test/temp'
setup = './setup.py'
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
options.view_as(SetupOptions).setup_file = "./setup.py"
google_cloud_options.project = project
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = staging_location
google_cloud_options.temp_location = temp_location
options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
input = 'gs://pipeline-operation-test/input2.txt'
output = 'gs://pipeline-operation-test/OUTPUTsetup.csv'
results =(
p|
'ReadMyFile'>>beam.io.ReadFromText(input)|
'Split'>>beam.ParDo(catchOutput())|
'CreateOutput'>>beam.io.WriteToText(output)
)
p.run()
if __name__ == '__main__':
run()
我做了一个" setup.py"用于包含将来脚本中所需的所有pakcages的脚本,也可以在gcp的数据流中运行。
然而,当我尝试在云中运行所有这些时,我有一些问题更准确,在运行数据流时我收到以下错误:
RuntimeError: CalledProcessError: Command '['/usr/bin/python', 'script2.py', '34']' returned non-zero exit status 2 [while running 'Split']
我尝试将导入调用函数(subprocess,sys)放在不同的区域中,我也尝试修改存储桶中的script2.py的路径,但没有任何工作。
最后一种退出错误的方法是使用以下命令修改脚本:
try:
s2_out = subprocess.check_output([sys.executable, "script2.py", "34"])
except subprocess.CalledProcessError as e:
s2_out = e.output
然后我的输出什么都没有。因为通过这样做我只减少管道运行但不能正确执行。
有人知道怎么能修好这个?
非常感谢你!
纪莲