使用Spark从同一区域中的多个s3存储桶读取

时间:2019-04-09 13:54:24

标签: apache-spark amazon-s3 amazon-emr

我正在尝试从多个s3存储桶读取文件。

最初,这些存储桶将位于不同的区域,但这似乎是不可能的。

因此,现在我已将另一个存储桶复制到与要读取的第一个存储桶相同的区域,这是我从中执行spark作业的区域。

SparkSession设置:

val sparkConf = new SparkConf()
          .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
          .registerKryoClasses(Array(classOf[Event]))

        SparkSession.builder
          .appName("Merge application")
          .config(sparkConf)
          .getOrCreate()

使用SQLContext从创建的SparkSession中调用的函数:

private def parseEvents(bucketPath: String, service: String)(
    implicit sqlContext: SQLContext
  ): Try[RDD[Event]] =
    Try(
      sqlContext.read
        .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
        .json(bucketPath)
        .toJSON
        .rdd
        .map(buildEvent(_, bucketPath, service).get)
    )

主要流程:

for {
      bucketOnePath               <- buildBucketPath(config.bucketOne.name)
      _                           <- log(s"Reading events from $bucketOnePath")
      bucketOneEvents: RDD[Event] <- parseEvents(bucketOnePath, config.service)
      _                           <- log(s"Enriching events from $bucketOnePath with originating region data")
      bucketOneEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
        bucketOneEvents,
        config.bucketOne.region
      )

      bucketTwoPath               <- buildBucketPath(config.bucketTwo.name)
      _                           <- log(s"Reading events from $bucketTwoPath")
      bucketTwoEvents: RDD[Event] <- parseEvents(config.bucketTwo.name, config.service)
      _                           <- log(s"Enriching events from $bucketTwoPath with originating region data")
      bucketTwoEventsWithRegion: RDD[Event] <- enrichEventsWithRegion(
        bucketTwoEvents,
        config.bucketTwo.region
      )

      _                        <- log("Merging events")
      mergedEvents: RDD[Event] <- merge(bucketOneEventsWithRegion, bucketTwoEventsWithRegion)
      if mergedEvents.isEmpty() == false
      _ <- log("Grouping merged events by partition key")
      mergedEventsByPartitionKey: RDD[(EventsPartitionKey, Iterable[Event])] <- eventsByPartitionKey(
        mergedEvents
      )

      _ <- log(s"Storing merged events to ${config.outputBucket.name}")
      _ <- store(config.outputBucket.name, config.service, mergedEventsByPartitionKey)
    } yield ()

我在日志中遇到的错误(实际存储桶名称已更改,但真实名称确实存在):

19/04/09 13:10:20 INFO SparkContext: Created broadcast 4 from rdd at MergeApp.scala:141
19/04/09 13:10:21 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs:someBucket2

我的stdout日志显示了失败之前主代码走了多远:

Reading events from s3://someBucket/*/*/*/*/*.gz
Enriching events from s3://someBucket/*/*/*/*/*.gz with originating region data
Reading events from s3://someBucket2/*/*/*/*/*.gz
Merge failed: Path does not exist: hdfs://someBucket2

奇怪的是,无论我选择哪个存储桶,第一次读取始终有效。 但是无论存储桶大小如何,第二次读取始终会失败。 这告诉我这些存储桶没有什么问题,但是在使用多个s3存储桶时会产生一些火花。

我只能看到从单个s3存储桶读取多个文件的线程,而不能从多个s3存储桶读取多个文件。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您在someBucket2路径中缺少s3://前缀,因此它正在尝试(默认)在hdfs中找到它