我有一组前缀(根据S3性能建议)镶木地板文件我想加载spark(使用Amazon EMR 5.11.1)但
val df = spark.read.parquet("s3://bucket/????/analytics")
java.lang.AssertionError: assertion failed: Conflicting directory
structures detected. Suspicious paths:?
s3://bucket/4a73/analytics
s3://bucket/8163/analytics
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
... 48 elided
答案 0 :(得分:0)
您可以使用s3a代替s3。 这可能适合您。
1.你还需要在类路径上使用hadoop-aws 2.7.1 JAR。 这个JAR包含
class org.apache.hadoop.fs.s3a.S3AFileSystem.
2.在spark.properties中,您可以进行如下设置:
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY