如何从大文件中提取JSON对象?

时间:2018-03-23 11:53:24

标签: python json

使用Python 3.x,我需要从一个大文件(> 5GB)中提取JSON对象,作为流读取。该文件存储在S3上,我不想将整个文件加载到内存中进行处理。因此,我读取了amt = 10000(或其他一些块大小)的数据块。

数据采用此格式

{
object-content
}{
object-content
}{
object-content
}

......等等。

为了解决这个问题,我尝试了一些方法,但我唯一可行的解​​决方案就是逐块阅读这些块,并寻找“}”。对于每个“}”,我尝试使用json.load()转换为json,这是索引的移动窗口。如果失败,则传递并移至下一个“}”。如果成功,则生成对象并更新索引。

def streamS3File(s3objGet):

    chunk = ""
    indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
    indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops

    while True:
        # Get a new chunk of data
        newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
        # If newChunk is zero, we are at the end of the file
        if len(newChunk) == 0:
            raise StopIteration
        # Add to the leftover from last chunk
        chunk = chunk + newChunk

        # Look for "}". For every "}", try to convert the part of the chunk
        # to JSON. If it fails, pass and look for the next "}".
        for m in re.finditer('[\{\}]', chunk):
            if m.group(0) == "}":
                try:
                    indexStop = m.end()
                    yield json.loads(chunk[indexStart:indexStop])
                    indexStart = indexStop
                except:
                    pass
        # Remove the part of the chunk allready processed and returned as objects
        chunk = chunk[indexStart:]
        # Reset indexes
        indexStart = 0
        indexStop = 0

for t in streamS3File(s3ReadObj):
    # t is the json-object found
    # do something with it here

我想输入其他方法来完成此任务:在文本流中查找json对象并在经过时提取json对象。

0 个答案:

没有答案