使用s3a

时间:2018-07-12 06:32:10

标签: apache-spark amazon-s3 apache-spark-sql parquet apache-spark-2.0

我有一个实木复合地板文件存储在s3a://bucket/file.parquet中。我想将文件加载到Spark(v2.2.0)的数据框中,进行一些分析,然后用更新的数据框覆盖镶木地板文件。问题是,当我尝试这样做时,我会丢失所有数据。具体来说,我会执行以下操作:

println("loading file")
val foo = spark.read.load("s3a://bucket/file.parquet")
foo.count

println("overwriting file")
foo.write.mode(SaveMode.Overwrite).format("parquet").save("s3a://bucket/file.parquet")

println("reloading file")
val bar = spark.read.load("s3a://bucket/file.parquet")
bar.count

此输出为:

loading file
foo: org.apache.spark.sql.DataFrame = [date: string, value: string ... 13 more fields]
res1: Long = 218528
overwriting file
reloading file
bar: org.apache.spark.sql.DataFrame = [date: string, value: string ... 13 more fields]
res2: Long = 0

如您所见,覆盖并重新加载文件会产生一个数据帧,该数据帧具有正确的架构,但不包含任何数据。

如果删除实木复合地板文件,然后写入一个新文件而不是覆盖现有文件,我将具有相同的行为:

import java.net.URI
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

println("loading file")
val foo = spark.read.load("s3a://bucket/file.parquet")
foo.count

println("overwriting file")
FileSystem.get(new URI("s3a://bucket/"),sc.hadoopConfiguration).delete(new Path("s3a://bucket/file.parquet"), true)
foo.write.format("parquet").save("s3a://bucket/file.parquet")

println("reloading file")
val bar = spark.read.load("s3a://bucket/file.parquet")
bar.count

此输出为:

import java.net.URI
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
loading file
foo: org.apache.spark.sql.DataFrame = [date: string, value: string ... 13 more fields]
res1: Long = 218528
overwriting file
res2: Boolean = true
reloading file
bar: org.apache.spark.sql.DataFrame = [date: string, value: string ... 13 more fields]
res3: Long = 0
  
    

问题: 给出了什么,我如何完成覆盖实木复合地板文件而不会丢失数据?

         

编辑:我认为这可能与S3的“最终一致性”有关,即我试图在写入文件后过早读取文件。假设这是正确的,是否可以在读取文件之前确定文件是否为最新版本?

  

出于其价值,我尝试切换到csv,但是当我尝试加载csv时,出现以下异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 298.0 failed 4 times, most recent failure: Lost task 0.3 in stage 298.0 (TID 9693, 10.23.19.244, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=s3://bucket/file.csv/part-00000-0f7e77f1-55fe-44b7-bb84-89d725441632-c000.csv; isDirectory=false; length=8356173; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

0 个答案:

没有答案