Question

我有一组前缀（根据S3性能建议）镶木地板文件我想加载spark（使用Amazon EMR 5.11.1）但

列出匹配glob的文件集所花费的时间比非前缀文件慢得多，这可以改进吗？
如何避免以下错误？

 val df = spark.read.parquet("s3://bucket/????/analytics")

java.lang.AssertionError: assertion failed: Conflicting directory
     structures detected. Suspicious paths:?
        s3://bucket/4a73/analytics
        s3://bucket/8163/analytics

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
  at scala.Predef$.assert(Predef.scala:170)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
  at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
  ... 48 elided

Answer 1

您可以使用s3a代替s3。这可能适合您。

1.你还需要在类路径上使用hadoop-aws 2.7.1 JAR。这个JAR包含

class org.apache.hadoop.fs.s3a.S3AFileSystem.

2.在spark.properties中，您可以进行如下设置：

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  
spark.hadoop.fs.s3a.access.key=ACCESSKEY  
spark.hadoop.fs.s3a.secret.key=SECRETKEY

从s3加载带有前缀的镶木地板文件 - 可疑路径

1 个答案: