Question

我尝试远程运行DataFlow管道，该管道将使用pickle文件。在本地，我可以使用下面的代码来调用文件。

with open (known_args.file_path, 'rb') as fp:
     file = pickle.load(fp)

但是，当路径是关于云存储（gs：// ...）：

时，我发现它无效

IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'

我有点理解为什么它不起作用但我找不到合适的方法去做。

Answer 1

如果你的GCS存储桶中有pickle文件，那么你可以将它们作为BLOB 加载，并像在代码中那样进一步处理它们（使用$ export F='USE information_schema;select column_name, column_type from information_schema.columns where TABLE_SCHEMA = `abc.domain.com` AND table_name=`bugs`;' $ printenv F USE information_schema;select column_name, column_type from information_schema.columns where TABLE_SCHEMA = `abc.domain.com` AND table_name=`bugs`; /bin/sh -c mysql -u"root" -p"123" -h 0 -e $F | column -t > /tmp/describe）：

pickle.load()

Answer 2

open()是标准的Python库函数，它无法理解Google云端存储路径。你需要使用Beam FileSystems API来代替它，以及它支持的其他文件系统。

如何在Google Cloud DataFlow作业中从GCS读取blob（pickle）文件？

2 个答案: