Question

我之所以问这个问题，是因为我使用S3生成的spark文件重命名比较慢。我将输出文件保存在spark中，如下所示

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumnRenamed("concatenated", headerFinal)



dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition")
      .format("csv")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .save(outputFileURL)

保存后，我需要重命名保存在S3中的文件。这是怎么做的。

val file = fs.globStatus(new Path(outputFileURL + "/*/*"))
        val finalPrefix = "Fundamental.Fundamental.Fundamental."
        val fileVersion = "1."
        val formatDate = new SimpleDateFormat("yyyy-MM-dd-hhmm")
        val now = Calendar.getInstance().getTime
        val finalFormat = formatDate.format(now)
        val currentTime = finalFormat + "."

        val fileExtention = "Full.txt"

        for (urlStatus <- file) {
          val DataPartitionName = urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString

          val finalFileName = finalPrefix + DataPartitionName + "." + fileVersion + currentTime + fileExtention
          val dest = new Path(mainFileURL + "/" + finalFileName)
          fs.rename(urlStatus.getPath, dest)
        }
        println("File renamed and moved to dir now delete output folder")
        myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)

文件重命名需要15分钟以上。我大约有2k个文件，总大小为200GB。我在这里做错什么了吗？

有什么更好的方法吗？

Answer 1

它是AWS S3存储中的副本，通常以6-10MB / s的速度进行测量。

Answer 2

在S3中没有过度命名的概念。我建议坚持使用hdfs，然后再做s3-distcp，这会更好。

重命名Spark中的S3文件是否会将文件加载到内存中

2 个答案: