Question

我正在尝试使用流数据框将文件（csv.gz格式）转换为镶木地板。我必须使用流数据帧，因为压缩的文件大小约为700 MB。该作业使用AWS EMR上的自定义jar运行。源，目标和检查点位置都在AWS S3上。但是当我尝试写入检查点时，作业失败并出现以下错误：

java.lang.IllegalArgumentException: 
Wrong FS: s3://my-bucket-name/transformData/checkpoints/sourceName/fileType/metadata,
expected: hdfs://ip-<ip_address>.us-west-2.compute.internal:8020

在EMR集群上运行的其他火花作业可以从S3读取和写入成功运行（但它们没有使用火花流）。所以我认为这不是this post中建议的S3文件系统访问问题。我也看了this question，但答案对我的情况没有帮助。我使用 Scala：2.11.8 和 Spark：2.1.0 。以下是我到目前为止的代码

...

    val spark = conf match {
      case null =>
        SparkSession
          .builder()
          .appName(this.getClass.toString)
          .getOrCreate()
      case _ =>
        SparkSession
          .builder()
          .config(conf)
          .getOrCreate()
    }

    // Read CSV file into structured streaming dataframe
    val streamingDF = spark.readStream
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("delimiter","|")
      .option("timestampFormat", "dd-MMM-yyyy HH:mm:ss")
      .option("treatEmptyValuesAsNulls", "true")
      .option("nullValue","")
      .schema(schema)
      .load(s"s3://my-bucket-name/rawData/sourceName/fileType/*/*/fileNamePrefix*")
      .withColumn("event_date", "event_datetime".cast("date"))
      .withColumn("event_year", year($"event_date"))
      .withColumn("event_month", month($"event_date"))

    // Write the results to Parquet
    streamingDF.writeStream
      .format("parquet")
      .option("path", "s3://my-bucket-name/transformedData/sourceName/fileType/")
      .option("compression", "gzip")
      .option("checkpointLocation", "s3://my-bucket-name/transformedData/checkpoints/sourceName/fileType/")
      .partitionBy("event_year", "event_month")
      .trigger(ProcessingTime("900 seconds"))
      .start()

我还尝试在URI中使用 s3n：// 而不是 s3：// ，但这似乎没有任何效果。

Answer 1

Tl; dr升级火花或避免使用s3作为检查点位置

Apache Spark (Structured Streaming) : S3 Checkpoint support

此外，您应该使用s3a：//

指定写路径

S3 Native，s3n：//文件系统的继承者，S3a：系统使用Amazon的库与S3交互。这允许S3a支持更大的文件（不超过5GB限制），更高性能的操作等等。文件系统旨在替代S3 Native的/后继者：只需替换URL模式，也可以从s3a访问从s3n：// URL访问的所有对象。

https://wiki.apache.org/hadoop/AmazonS3

使用AWS EMR上的自定义jar的Spark流式传输作业在写入

1 个答案: