Question

我正在尝试使用Python读取一个大的json文件（大约3 Go）。该文件实际上包含大约700万个json对象（每行一个）。

我尝试了很多不同的解决方案，但我一直遇到相同的错误：

json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 25 (char 24)

我使用的代码在这里：

import json
import pandas as pd

with open('mydata.json') as json_file:
data = json_file.readlines()
# this line below may take at least 8-10 minutes of processing for 4-5 
# million rows. It converts all strings in list to actual json objects.
data = list(map(json.loads, data))

pd.DataFrame(data)

关于我为什么会收到此错误的任何想法？它似乎与文件的格式有关，但原则上是正确的json格式（我已经用https://jsonformatter.curiousconcept.com/检查了几行）。

我还尝试读取该文件的简短版本（仅约30行），并且此操作成功。

Answer 1

对BoboDarph代码的Python 3版本进行了稍微清理：

import json
import logging
import pandas as pd

logger = logging.getLogger(__name__)

def iter_good_json_lines(lines):
    for lineno, line in enumerate(lines, 1):
        try:
            yield json.loads(line.strip())
        except json.JSONDecodeError as err:
            logger.warning(f"lineno {lineno}:{err.colno} {err.msg}: {err.doc}")

with open('mydata.json') as fd:
    data = pd.DataFrame(iter_good_json_lines(fd))

data

此更改：

迭代一个打开的文件会为您提供一个迭代器，该迭代器会产生行
使用logging模块，以免错误不会在stdout上结束
Pandas> = 0.13允许将生成器传递给DataFrame构造函数
f弦！

Answer 2

详细说明以上评论：数据文件中的一行或多行代码很可能不是JSON，因此Python尝试将字符串加载到JSON对象时会出错。

根据您的需求，您可以允许代码失败，因为您依赖该文件的所有行都是JSON，如果不是，则想知道（就像现在一样），或者可以完全避免解析非JSON行，并在遇到任何情况时让您的代码发出警告。

要实现第二种解决方案，请将字符串包装为JSON，然后将其包装到try块中，以清除所有有问题的行。如果这样做，所有非JSONS的行将被忽略，您的代码将继续尝试解析所有其他行。

这是我要实现的方式：

import json
from json import JSONDecodeError
import pandas as pd
data = []
with open('mydata.json') as json_file:
    for line in json_file.readlines():
        js = None
        try:
            js = json.loads(line)
        except JSONDecodeError:
            print('Skipping line %s' %(line))
        if js:
            #You don't want None value in your dataframe
            data.append(js)
test = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(test)

使用Python读取大JSON文件时出错：“ json.decoder.JSONDecodeError：预期为'，'分隔符”

2 个答案: