我正在尝试使用org.apache.hadoop.tools.DistCp类将一些文件复制到S3存储桶中。但是,尽管将覆盖标志显式设置为true,覆盖功能仍无法工作
复制工作正常,但如果存在现有文件,则不会覆盖。复制映射器会跳过这些文件。我已将“覆盖”选项明确设置为true。
import com.typesafe.scalalogging.LazyLogging
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
import org.apache.hadoop.util.ToolRunner
import scala.collection.JavaConverters._
object distcptest extends App with LazyLogging {
def copytoS3( hdfsSrcFilePathStr: String, s3DestPathStr: String) = {
val hdfsSrcPathList = List(new Path(hdfsSrcFilePathStr))
val s3DestPath = new Path(s3DestPathStr)
val distcpOpt = new DistCpOptions(hdfsSrcPathList.asJava, s3DestPath)
// Overwriting is not working inspite of explicitly setting it to true.
distcpOpt.setOverwrite(true)
val conf: Configuration = new Configuration()
conf.set("fs.s3n.awsSecretAccessKey", "secret key")
conf.set("fs.s3n.awsAccessKeyId", "access key")
conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val distCp: DistCp = new DistCp(conf, distcpOpt)
val filepaths: Array[String] = Array(hdfsSrcFilePathStr, s3DestPathStr)
try {
val distCp_result = ToolRunner.run(distCp, filepaths)
if (distCp_result != 0) {
logger.error(s"DistCP has failed with - error code = $distCp_result")
}
}
catch {
case e: Exception => {
e.printStackTrace()
}
}
}
copytoS3("hdfs://abc/pqr", "s3n://xyz/wst")
}
答案 0 :(得分:0)
我认为问题在于您叫ToolRunner.run(distCp,filepaths)。
如果检查DistCp的源代码,则在run方法中将覆盖inputOptions,因此传递给构造函数的DistCpOptions将不起作用。
@Override
public int run(String[] argv) {
...
try {
inputOptions = (OptionsParser.parse(argv));
...
} catch (Throwable e) {
...
}
...
}