Question

使用Python 3.x，我需要从一个大文件（> 5GB）中提取JSON对象，作为流读取。该文件存储在S3上，我不想将整个文件加载到内存中进行处理。因此，我读取了amt = 10000（或其他一些块大小）的数据块。

数据采用此格式

{
object-content
}{
object-content
}{
object-content
}

......等等。

为了解决这个问题，我尝试了一些方法，但我唯一可行的解决方案就是逐块阅读这些块，并寻找“}”。对于每个“}”，我尝试使用json.load（）转换为json，这是索引的移动窗口。如果失败，则传递并移至下一个“}”。如果成功，则生成对象并更新索引。

def streamS3File(s3objGet):

    chunk = ""
    indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
    indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops

    while True:
        # Get a new chunk of data
        newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
        # If newChunk is zero, we are at the end of the file
        if len(newChunk) == 0:
            raise StopIteration
        # Add to the leftover from last chunk
        chunk = chunk + newChunk

        # Look for "}". For every "}", try to convert the part of the chunk
        # to JSON. If it fails, pass and look for the next "}".
        for m in re.finditer('[\{\}]', chunk):
            if m.group(0) == "}":
                try:
                    indexStop = m.end()
                    yield json.loads(chunk[indexStart:indexStop])
                    indexStart = indexStop
                except:
                    pass
        # Remove the part of the chunk allready processed and returned as objects
        chunk = chunk[indexStart:]
        # Reset indexes
        indexStart = 0
        indexStop = 0

for t in streamS3File(s3ReadObj):
    # t is the json-object found
    # do something with it here

我想输入其他方法来完成此任务：在文本流中查找json对象并在经过时提取json对象。

如何从大文件中提取JSON对象？

0 个答案: