使用Python 3.x,我需要从一个大文件(> 5GB)中提取JSON对象,作为流读取。该文件存储在S3上,我不想将整个文件加载到内存中进行处理。因此,我读取了amt = 10000(或其他一些块大小)的数据块。
数据采用此格式
{
object-content
}{
object-content
}{
object-content
}
......等等。
为了解决这个问题,我尝试了一些方法,但我唯一可行的解决方案就是逐块阅读这些块,并寻找“}”。对于每个“}”,我尝试使用json.load()转换为json,这是索引的移动窗口。如果失败,则传递并移至下一个“}”。如果成功,则生成对象并更新索引。
def streamS3File(s3objGet):
chunk = ""
indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops
while True:
# Get a new chunk of data
newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
# If newChunk is zero, we are at the end of the file
if len(newChunk) == 0:
raise StopIteration
# Add to the leftover from last chunk
chunk = chunk + newChunk
# Look for "}". For every "}", try to convert the part of the chunk
# to JSON. If it fails, pass and look for the next "}".
for m in re.finditer('[\{\}]', chunk):
if m.group(0) == "}":
try:
indexStop = m.end()
yield json.loads(chunk[indexStart:indexStop])
indexStart = indexStop
except:
pass
# Remove the part of the chunk allready processed and returned as objects
chunk = chunk[indexStart:]
# Reset indexes
indexStart = 0
indexStop = 0
for t in streamS3File(s3ReadObj):
# t is the json-object found
# do something with it here
我想输入其他方法来完成此任务:在文本流中查找json对象并在经过时提取json对象。