从S3文件创建Spark RDD

时间:2019-07-18 01:33:24

标签: apache-spark amazon-s3 pyspark

我是Spark和AWS S3的新手。我的目标是将多个gzip文件从AWS S3读取到RDD中,对每个文件执行一些转换/操作,并将数据存储在S3上的输出文件中。我还没有那个。我有一个简单的EC2实例,安装了Ubuntu 16.04,Python,Spark和BOTO3。我很难得到一个简单的数字。这是代码。有人可以帮我解决我做错的事情吗?

import os.path
from pathlib import Path
from pyspark import SparkContext, SparkConf
from boto3.session import Session

# Raj Tambe credentials
ACCESS_KEY = 'Blah Blah'
SECRET_KEY = 'Blah Blah'
BUCKET_NAME = 'bucketname'
PREFIX = 'foldername/'
MAX_FILES_READ = 3

if __name__ == "__main__":
        # Use Boto to connect to S3 and get a list of objects from a bucket
        session = Session(aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)

        s3 = session.resource('s3')

        # call S3 to list current buckets
        my_bucket = s3.Bucket(BUCKET_NAME)

        # Get a Spark context and use it to parallelize the keys
        conf = SparkConf().setAppName("MyFileProcessingApp")
        sc = SparkContext(conf=conf)

        index = 0
        for s3_file in my_bucket.objects.filter(Prefix=PREFIX):
                if 'gz' in s3_file.key:
                        index += 1
                        if index == MAX_FILES_READ:
                                break
                        sc._jsc.hadoopCon
                        sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
                        sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', ACCESS_KEY)
                        sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', SECRET_KEY)
                        fileLocation = "s3a://" + BUCKET_NAME + '/' + s3_file.key
                        print ("file location: ", fileLocation)
                        s3File = sc.textFile(fileLocation)
                        print(s3File.count())

当然,Amazon提供了EMR和Hadoop来简化它,但我还没有。我想让它首先在原型上工作。当我最后计数时,代码将失败。另外,这段代码看起来非常初级,任何其他反馈都值得赞赏。预先感谢。

0 个答案:

没有答案