我是Spark和AWS S3的新手。我的目标是将多个gzip文件从AWS S3读取到RDD中,对每个文件执行一些转换/操作,并将数据存储在S3上的输出文件中。我还没有那个。我有一个简单的EC2实例,安装了Ubuntu 16.04,Python,Spark和BOTO3。我很难得到一个简单的数字。这是代码。有人可以帮我解决我做错的事情吗?
import os.path
from pathlib import Path
from pyspark import SparkContext, SparkConf
from boto3.session import Session
# Raj Tambe credentials
ACCESS_KEY = 'Blah Blah'
SECRET_KEY = 'Blah Blah'
BUCKET_NAME = 'bucketname'
PREFIX = 'foldername/'
MAX_FILES_READ = 3
if __name__ == "__main__":
# Use Boto to connect to S3 and get a list of objects from a bucket
session = Session(aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
# call S3 to list current buckets
my_bucket = s3.Bucket(BUCKET_NAME)
# Get a Spark context and use it to parallelize the keys
conf = SparkConf().setAppName("MyFileProcessingApp")
sc = SparkContext(conf=conf)
index = 0
for s3_file in my_bucket.objects.filter(Prefix=PREFIX):
if 'gz' in s3_file.key:
index += 1
if index == MAX_FILES_READ:
break
sc._jsc.hadoopCon
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', ACCESS_KEY)
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', SECRET_KEY)
fileLocation = "s3a://" + BUCKET_NAME + '/' + s3_file.key
print ("file location: ", fileLocation)
s3File = sc.textFile(fileLocation)
print(s3File.count())
当然,Amazon提供了EMR和Hadoop来简化它,但我还没有。我想让它首先在原型上工作。当我最后计数时,代码将失败。另外,这段代码看起来非常初级,任何其他反馈都值得赞赏。预先感谢。