如何使用Boto3从Amazon S3读取大型JSON文件

时间:2018-08-01 00:36:39

标签: json amazon-s3 etl boto3

我正在尝试从Amazon S3读取JSON文件,并且其文件大小约为2GB。当我使用方法.read()时,它给了我MemoryError

这个问题有解决方案吗?任何帮助都可以,非常感谢!

4 个答案:

答案 0 :(得分:2)

因此,我找到了一种有效地为我工作的方法。我有1.60 GB的文件,需要加载以进行处理。

s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)

# Now we collected data in the form of bytes array.
data_in_bytes = s3.Object(bucket_name, filename).get()['Body'].read()

#Decode it in 'utf-8' format
decoded_data = data_in_bytes.decode('utf-8')

#I used io module for creating a StringIO object.
stringio_data = io.StringIO(decoded_data)

#Now just read the StringIO obj line by line.
data = stringio_data.readlines()

#Its time to use json module now.
json_data = list(map(json.loads, data))

因此json_data是文件的内容。我知道有很多变量操作,但是对我有用。

答案 1 :(得分:1)

只需遍历对象即可。

s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
fileObj = s3.get_object(Bucket='bucket_name', Key='key')
for row in fileObj["body"]:
    line = row.decode('utf-8')
    print(json.loads(line))

答案 2 :(得分:0)

我刚刚解决了这个问题。这是代码。希望对将来有帮助!

s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
obj = s3.get_object(Bucket='bucket_name', Key='key')
data = (line.decode('utf-8') for line in obj['Body'].iter_lines())
    for row in file_content:
        print(json.loads(row))

答案 3 :(得分:0)

IF Comparison = 'GroupA vs. GroupB' < .05 THEN 
DO;
   SUBGROUP = GROUPA;
   CI_DIFF = CATX(CI_DIFF, ^{super 2,3};
END;