尝试从不同区域的s3存储桶读取文件时出现“找不到路径”

时间:2019-04-08 14:19:31

标签: apache-spark hadoop amazon-s3 amazon-emr

我正试图从位于不同区域的两个不同的s3存储桶中读取所有文件。

当读取位于EAST实例所在的us-east-1中的第一个存储桶时,由于我的主要理解力并未消失,因此代码看起来很高兴。

从位于us-west-2的第二个存储桶中读取时,代码错误并显示以下消息: “合并失败:路径不存在:hdfs://ip-10-240-15-43.bamtech.test.us-east-1.bamgrid.net:8020 / user / hadoop / some-bucket-us-west -2;“

我已尝试按照此处https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-per-bucket-region-configs.html

中的说明在spark conf中设置两个存储桶的端点
val sparkConf = new SparkConf()
          .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
          //.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
          .set(
            "fs.s3.bucket.some-bucket-us-east-1.endpoint",
            "s3.amazonaws.com"
          )
          .set(
            "fs.s3.bucket.some-bucket-us-west-2.endpoint",
            "s3-us-west-2.amazonaws.com"
          )
          // Todo: Add european buckets.
          .registerKryoClasses(Array(classOf[Event]))

我尝试将s3协议更改为s3a(以及上述属性键从“ fs.s3”更改为“ fs.s3a”),但是该应用程序似乎在从第一个存储桶读取之前冻结了。

存储桶的名称由于敏感而被更改,但是实际的存储桶名称确实存在。我怀疑这与跨区域有关,但是从我的研究看来,这是从emr 5.1.0(我正在使用emr 5.21.0)中修复的。

这是主要的理解力:

for {
      bucketOnePath               <- buildBucketPath(config.bucketOne.name)
      _                           <- log(s"Reading events from $bucketOnePath")
      bucketOneEvents: RDD[Event] <- parseEvents(bucketOnePath, config.service)
      _                           <- log(s"Enriching events from $bucketOnePath with originating region data")
      bucketOneEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
        bucketOneEvents,
        config.bucketOne.region
      )

      bucketTwoPath               <- buildBucketPath(config.bucketTwo.name)
      _                           <- log(s"Reading events from $bucketTwoPath")
      bucketTwoEvents: RDD[Event] <- parseEvents(config.bucketTwo.name, config.service)
      _                           <- log(s"Enriching events from $bucketTwoPath with originating region data")
      bucketTwoEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
        bucketTwoEvents,
        config.bucketTwo.region
      )

      _                        <- log("Merging events")
      mergedEvents: RDD[Event] <- merge(bucketOneEventsWithRegion, bucketTwoEventsWithRegion)
      if mergedEvents.isEmpty() == false
      _ <- log("Grouping merged events by partition key")
      mergedEventsByPartitionKey: RDD[(EventsPartitionKey, Iterable[Event])] <- eventsByPartitionKey(
        mergedEvents
      )

      _ <- log(s"Storing merged events to ${config.outputBucket.name}")
      _ <- store(config.outputBucket.name, config.service, mergedEventsByPartitionKey)
    } yield ()

这是emr日志中的stdout输出:

Reading events from s3://some-bucket-us-east-1/*/*/*/*/*.gz
Enriching events from s3://some-bucket-us-east-1/*/*/*/*/*.gz with originating region data
Reading events from s3://some-bucket-us-west-2/*/*/*/*/*.gz
Merge failed: Path does not exist: hdfs://ip-10-240-15-43.bamtech.test.us-east-1.bamgrid.net:8020/user/hadoop/some-bucket-us-west-2;

0 个答案:

没有答案