Question

带有100的JSON的源S3位置

所有JSON文件都需要合并为一个JSON文件。即非part-0000...个文件
输出的单个JSON文件需要替换源S3位置上的所有这些文件
同一JSON文件需要转换为Parquet并保存到其他S3位置

除了下面还有什么最好的选择，

将JSON文件加载到Dataframe
将其保存在本地磁盘上
将组合的JSON文件上传到S3
使用AWS开发工具包客户端API成功上传合并的S3文件后，清理其余的S3文件
此操作与4并行运行。通过数据框API将实木复合地板文件保存到实木复合地板S3位置

我对上述设计有以下疑问

还有其他更强大的方法吗？
我可以读取和写入相同的S3位置并跳过步骤号吗？ 2。

Answer 1

是可以跳过＃2。可以使用SaveMode.Overwrite与您读取的位置相同的位置写入相同的位置。

当您第一次读取json即数据帧＃1时，如果您进行缓存，它将存储在内存中。之后，您可以进行清理，然后将所有json与union一起编译为一个，并在一个步骤中将其存储在Parquet文件中。这样的例子。
情况1：所有json都位于不同的文件夹中，并且您希望它们将最终数据帧作为实木复合地板存储在json所在的相同位置...

val dfpath1 = spark.read.json("path1")
val dfpath2 =  spark.read.json("path2")
val dfpath3 =  spark.read.json("path3")

val df1 = cleanup1 function dfpath1 returns dataframe
val df2 = cleanup2 function dfpath2 returns dataframe
val df3 = cleanup3 function dfpath3 returns dataframe

val dfs = Seq(df1, df2, df3)
val finaldf = dfs.reduce(_ union _) // you should have same schema while doing union..

 
  finaldf.write.mode(SaveMode.Overwrite).parquet("final_file with samelocations json.parquet")

情况2 ：所有json都位于同一文件夹中，并且您希望它们将最终数据帧存储为多个实木复合地板，并位于json所在的相同根位置... < / p>

在这种情况下，无需读取多个数据帧，您可以给出根路径，其中存在具有相同模式的json

val dfpath1 = spark.read.json("rootpathofyourjsons with same schema")

// or you can give multiple paths spark.read.json("path1","path2","path3")
 // since it s supported by spark dataframe reader like this ...def json(paths: String*):
val finaldf = cleanup1 function returns  dataframe
finaldf.write.mode(SaveMode.Overwrite).parquet("final_file with sameroot locations json.parquet")

AFAIK，无论哪种情况都不再需要aws s3 sdk api。

UPDATE：Reg. File Not Found Exception you are facing... see below code example of how to do it. I quoted the same example you showed me here

import org.apache.spark.sql.functions._
  val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")

  df.show(false)

  df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
  val df1 = spark.read.format("parquet").load(".../temp") // read back again

 val df2 = df1.withColumn("cleanup" , lit("Quick silver want to cleanup")) // like you said you want to clean it.

  //BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.

  df2.cache // cache to avoid FileNotFoundException
  df2.show(2, false) // light action to avoid FileNotFoundException
   // or println(df2.count) // action

   df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
  println("quick silver saved in same directory where he read it from final records he saved after clean up are  ")
  df2.show(false)

结果：

+---+----+
|sex|date|
+---+----+
|1  |10  |
|2  |20  |
|3  |30  |
+---+----+

+---+----+----------------------------+
|sex|date|cleanup                     |
+---+----+----------------------------+
|1  |10  |Quick silver want to cleanup|
|2  |20  |Quick silver want to cleanup|
+---+----+----------------------------+
only showing top 2 rows

quick silver saved in same directory where he read it from final records he saved after clean up are  
+---+----+----------------------------+
|sex|date|cleanup                     |
+---+----+----------------------------+
|1  |10  |Quick silver want to cleanup|
|2  |20  |Quick silver want to cleanup|
|3  |30  |Quick silver want to cleanup|
+---+----+----------------------------+

保存的文件的屏幕快照，并清理回读并再次保存：

注意：您需要实现情况1 或 情况2 ，如上面建议的更新...

Answer 2

spark.read
                  .json(sourcePath)
                  .coalesce(1)
                  .write
                  .mode(SaveMode.Overwrite)
                  .json(tempTarget1)

                val fs = FileSystem.get(new URI(s"s3a://$bucketName"), sc.hadoopConfiguration)

                val deleted = fs
                  .delete(new Path(sourcePath + File.separator), true)
                logger.info(s"S3 folder path deleted=${deleted} sparkUuid=$sparkUuid path=${sourcePath}")

                val renamed = fs
                  .rename(new Path(tempTarget1),new Path(sourcePath))

尝试失败，

数据框缓存/持久性无法正常工作，因为每当我尝试写入cachedDf.write时，都会检查我在写入之前手动清除的S3文件。
将Dataframe直接写入相同的S3目录不起作用，因为Dataframe仅覆盖已分区的文件，即以'part-00 ...'开头的文件。

将多个JSON文件合并为单个JSON和镶木地板文件

2 个答案:

UPDATE：Reg. File Not Found Exception you are facing... see below code example of how to do it. I quoted the same example you showed me here