Question

在Google上无处可寻，找不到解决此问题的方法，我继续收到以下错误：

JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

错误发生在我的Python文件中的行row = json.loads(row)。 JSON文件包含来自2015-05的Reddit注释的部分：

JSON（learn\learning_data\2015\RC_2015-05）：

{
  "created_utc": "1430438400",
  "ups": 4,
  "subreddit_id": "t5_378oi",
  "link_id": "t3_34di91",
  "name": "t1_cqug90g",
  "score_hidden": false,
  "author_flair_css_class": null,
  "author_flair_text": null,
  "subreddit": "soccer_jp",
  "id": "cqug90g",
  "removal_reason": null,
  "gilded": 0,
  "downs": 0,
  "archived": false,
  "author": "rx109",
  "score": 4,
  "retrieved_on": 1432703079,
  "body": "\u304f\u305d\n\u8aad\u307f\u305f\u3044\u304c\u8cb7\u3063\u305f\u3089\u8ca0\u3051\u306a\u6c17\u304c\u3059\u308b\n\u56f3\u66f8\u9928\u306b\u51fa\u306d\u30fc\u304b\u306a",
  "distinguished": null,
  "edited": false,
  "controversiality": 0,
  "parent_id": "t3_34di91"
}

* JSON数据只是我实际拥有的一小部分，我无法改变格式。例如。

{
  "text": "data",
  "text": "data"
}
{
  "text2": "data",
  "text2": "data"
}
{
  "text3": "data",
  "text3": "data"
}
etc...

Python（learn\main.py）：

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
for row in f:
    row_counter += 1
    row = json.loads(row)
    body = format_data(row['body'])
    created_utc = row['created_utc']
    parent_id = row['parent_id']
    comment_id = row['name']
    score = row['score']
    subreddit = row['subreddit']       
    parent_data = find_parent(parent_id)

    if score >= 2:
        if acceptable(body):
            existing_comment_score = find_existing_score(parent_id)

JSON文件已经在所有需要双引号的内容上都有双引号。如果有其他错误导致这个JSONLint.com声称JSON免于他们。

我一直在引用this tutorial中的代码，其中教程的代码工作正常，没有任何错误（根据附带的视频，使用上面链接中的代码，我仍然收到错误）。因为本教程使用的是Python 3.5，所以我降级了我的Python版本并继续得到同样的错误。

如果JSON已使用双引号并且JSONLint有效，那么导致此错误的原因是什么？

Answer 1

您的JSON中有换行符。

但是你的代码一次只读一行并期望它是一个完整的JSON文本：

for row in f:
    row_counter += 1
    row = json.loads(row)

那不会起作用。

如果您的文件只是一个JSON文本，请阅读整个内容：

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
    row_counter += 1
    row = json.load(f)

您可能希望将row重命名为更有意义的内容，例如contents。

如果您的文件是一系列JSON文本，并且您自己生成文件，那么正确的做法是更改生成文件的方式。任意JSON文本流不是真正有效的格式。但是如果你真的想在它之上构建一个格式，你可以 - 例如，转义所有换行符，以便你可以逐行解析它。或者您可以使用真实格式。或者你可以写出一个大的JSON数组而不是一堆单独的JSON文本。

如果您无法更改文件，则需要一种策略来解析它。所有这些几乎正确：

使用json模块的raw_decode方法读取下一个JSON文本，并将解码后的值加上偏移量返回到下一个。
每次计数到0时，平衡括号和括号并拆分。
扫描换行符，然后回溯以检查是否打开括号和大括号。

除了糟糕的错误处理之外，其中任何一个问题的唯一严重问题是他们不能将数字作为顶级文本做正确的事情。如果您的顶级文本都是对象，那不是问题。

所以：

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
    contents = f.read()
    decoder = json.JSONDecoder()
    while contents:
        row, idx = decoder.raw_decode(contents)
        row_counter += 1
        contents = contents[idx:].lstrip()
        # etc.

虽然如果你的文件很大，你几乎肯定想要mmap并将切片/内存视图传递给raw_decode - 或者，如果由于你有Unicode文本而无法工作，您可能需要手动缓冲块。不完全是微不足道的，但是你正在解析一个破碎的格式，所以这比你想象的要容易。：）

Answer 2

JSON文档流，每行一个，是一种称为JSONL的格式。这与“JSON”不同，它只允许一个文档到文件。

您可以通过运行jq -c . <in.json >out.json轻松将文件转换为此格式。 jq是一个用于处理JSON和JSONL文档的命令行工具; -c标志启用“紧凑”模式，该模式将每个文档放在每行输出上。

更简单，您可以在线完成，让您的Python代码直接迭代jq的输出：

import subprocess

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe)) as f:
    p = subprocess.Popen(['jq', '-c', '.'], stdin=f, stdout=subprocess.PIPE)
    for line in p.stdout:
        content = json.loads(line)
        # ...process your line's content here.

JSONDecodeError＆＃34;期望用双引号括起来的属性名称＆＃34;来自具有多个JSON文档的文件

2 个答案: