Question

我在S3中存储了大约1百万个文本文件。我想根据文件夹名称重命名所有文件。

我怎么能在spark-scala中做到这一点？

我正在寻找一些示例代码。

我正在使用zeppelin运行我的火花脚本。

下面的代码我按照回答

的建议尝试过

import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)

但低于错误

<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
       val fs = Path.getFileSystem(conf)

Answer 1

您可以使用普通的HDFS API，例如（输入，未测试）

val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)

S3A客户端伪造重命名的方式是每个文件的copy + delete，因此所花费的时间与#of文件和数据量成正比。并且S3限制了你：如果你试图并行执行此操作，它可能会减慢你的速度。如果需要“一段时间”，不要感到惊讶。

您还可以按COPY通话收费，每1000次通话费用为0.005，因此您需要花费5美元才能尝试。测试一个小目录，直到你确定一切正常

如何在spark scala中重命名S3文件而不是HDFS

1 个答案: