Question

我让Spark处理EMR，通过EMRFS将JSON文件写入S3：

dataframe
  .coalesce(1)
  .write()
  .option("compression", "gzip")
  .mode(SaveMode.Overwrite)
  .json(outputPath);

问题是输出文件仅包含一个头 Content-Type = application/octet-stream。并且缺少另一个Content-Encoding = gzip。

在使用Spark编写输出文件时如何设置元数据Content-Encoding = gzip？

Answer 1

您还可以使用选项（地图）

val元数据选项= Map（“压缩”->“ gzip”，“ Content-Language”->“ US-En”）;

dataframe.coalesce（1）.write（）。mode（SaveMode.Overwrite）.options（元数据选项）.json（outputPath）;

您需要导入
导入scala.collection.Map;