我正试图从位于不同区域的两个不同的s3存储桶中读取所有文件。
当读取位于EAST实例所在的us-east-1中的第一个存储桶时,由于我的主要理解力并未消失,因此代码看起来很高兴。
从位于us-west-2的第二个存储桶中读取时,代码错误并显示以下消息: “合并失败:路径不存在:hdfs://ip-10-240-15-43.bamtech.test.us-east-1.bamgrid.net:8020 / user / hadoop / some-bucket-us-west -2;“
中的说明在spark conf中设置两个存储桶的端点val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set(
"fs.s3.bucket.some-bucket-us-east-1.endpoint",
"s3.amazonaws.com"
)
.set(
"fs.s3.bucket.some-bucket-us-west-2.endpoint",
"s3-us-west-2.amazonaws.com"
)
// Todo: Add european buckets.
.registerKryoClasses(Array(classOf[Event]))
我尝试将s3协议更改为s3a(以及上述属性键从“ fs.s3”更改为“ fs.s3a”),但是该应用程序似乎在从第一个存储桶读取之前冻结了。
存储桶的名称由于敏感而被更改,但是实际的存储桶名称确实存在。我怀疑这与跨区域有关,但是从我的研究看来,这是从emr 5.1.0(我正在使用emr 5.21.0)中修复的。
这是主要的理解力:
for {
bucketOnePath <- buildBucketPath(config.bucketOne.name)
_ <- log(s"Reading events from $bucketOnePath")
bucketOneEvents: RDD[Event] <- parseEvents(bucketOnePath, config.service)
_ <- log(s"Enriching events from $bucketOnePath with originating region data")
bucketOneEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
bucketOneEvents,
config.bucketOne.region
)
bucketTwoPath <- buildBucketPath(config.bucketTwo.name)
_ <- log(s"Reading events from $bucketTwoPath")
bucketTwoEvents: RDD[Event] <- parseEvents(config.bucketTwo.name, config.service)
_ <- log(s"Enriching events from $bucketTwoPath with originating region data")
bucketTwoEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
bucketTwoEvents,
config.bucketTwo.region
)
_ <- log("Merging events")
mergedEvents: RDD[Event] <- merge(bucketOneEventsWithRegion, bucketTwoEventsWithRegion)
if mergedEvents.isEmpty() == false
_ <- log("Grouping merged events by partition key")
mergedEventsByPartitionKey: RDD[(EventsPartitionKey, Iterable[Event])] <- eventsByPartitionKey(
mergedEvents
)
_ <- log(s"Storing merged events to ${config.outputBucket.name}")
_ <- store(config.outputBucket.name, config.service, mergedEventsByPartitionKey)
} yield ()
这是emr日志中的stdout输出:
Reading events from s3://some-bucket-us-east-1/*/*/*/*/*.gz
Enriching events from s3://some-bucket-us-east-1/*/*/*/*/*.gz with originating region data
Reading events from s3://some-bucket-us-west-2/*/*/*/*/*.gz
Merge failed: Path does not exist: hdfs://ip-10-240-15-43.bamtech.test.us-east-1.bamgrid.net:8020/user/hadoop/some-bucket-us-west-2;