Question

我想将Spark作业生成的多个文件合并到一个文件中。通常我会做类似的事情：

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), deleteSrcFiles, hadoopConfig, null)

使用例如/tmp/some/path/to.csv之类的路径在本地运行正常，但在群集my-cluster上执行时会导致异常：

Wrong FS: gs://myBucket/path/to/result.csv, expected: hdfs://my-cluster-m

是否可以从Dataproc群集上运行的scala / java代码获取gs：//路径的FileSystem？

修改

找到了google-storage客户端库： https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-java

Answer 1

您只能使用属于该文件系统的特定文件系统的路径，例如你不能像上面那样将gs：//路径传递给HDFS。

以下代码段适用于我：

val hadoopConfig = new Configuration()
val srcPath = new Path("hdfs:/tmp/foo")
val hdfs = srcPath.getFileSystem(hadoopConfig)
val dstPath = new Path("gs://bucket/foo")
val gcs = dstPath.getFileSystem(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, srcPath, gcs, dstPath, deleteSrcFiles, hadoopConfig, null)

在Google Dataproc上使用FileUtil.copyMerge？

修改

1 个答案: