我正尝试使用Python将Amazon S3中的大ZIPPED JSON FILE导入AWS RDS-PostgreSQL。但是,发生了这些错误,
回溯(最近通话最近一次):
文件“ my_code.py”,第64行,在 file_content = f.read()。decode('utf-8')。splitlines(True)
文件“ /usr/lib64/python3.6/zipfile.py”,行835,处于读取状态 buf + = self._read1(self.MAX_N)
文件“ /usr/lib64/python3.6/zipfile.py”,第925行,在_read1中 数据= self._decompressor.decompress(data,n)
MemoryError
// my_code.py
import sys
import boto3
import psycopg2
import zipfile
import io
import json
import config
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()
bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)
def insert_query():
query = """
INSERT INTO data_table
SELECT
(src.test->>'url')::varchar, (src.test->>'id')::bigint,
(src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
FROM (SELECT CAST(%s AS JSONB) AS test) src
"""
cursor.execute(query, (json.dumps(data),))
if key.endswith('.zip'):
zip_files = obj['Body'].read()
with io.BytesIO(zip_files) as zf:
zf.seek(0)
with zipfile.ZipFile(zf, mode='r') as z:
for filename in z.namelist():
with z.open(filename) as f:
file_content = f.read().decode('utf-8').splitlines(True)
for row in file_content:
data = json.loads(row)
insert_query()
if key.endswith('.json'):
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for row in file_content:
data = json.loads(row)
insert_query()
connection.commit()
connection.close()
这些问题有解决方案吗?任何帮助都可以,非常感谢!
答案 0 :(得分:1)
问题是您尝试一次将整个文件读入内存,如果文件确实太大,可能会导致内存用完。
您应该一次读取一行文件,并且由于文件中的每一行显然都是JSON字符串,因此您可以直接在循环中处理每一行:
with z.open(filename) as f:
for line in f:
insert_query(json.loads(line.decode('utf-8')))
您的insert_query
函数应通过以下方式接受data
作为参数:
def insert_query(data):