Question

我正在建立一个聊天机器人数据库atm。我使用来自pushshift.io的数据。为了处理大数据文件，（我知道json会将所有内容都加载到RAM中，因此，如果您只有16GB RAM并处理30GB数据，那是一个非诺），我编写了一个bash脚本来拆分将大文件分成3GB的较小文件，以便我可以通过json.loads（或pd.read_json）运行它。每当我运行代码时，问题都会返回

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

因此，我看了一下刚刚创建的temp json文件，并发现这发生在我的JSON文件中：

ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}

数据的样本校正如下所示

{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}

我注意到我的bash脚本在不注意JSON对象的情况下拆分了文件。所以我的问题是，有没有办法在python中编写一个可以检测格式不正确的JSON对象并将其删除的函数？

Answer 1

没有太多的信息可做，但是我会对框架提出一些挑战。

Python提供了几个增量json解析器。快速搜索显示ijson应该可以让您遍历非常大的数据结构而不会爆炸。

您还应该考虑另一种数据格式（或真实的数据库），否则您会发现自己花时间重新实现使用合适工具所拥有的慢得多的功能版本。

Answer 2

如果您使用的是json standard library，则对格式错误的数据调用json.loads将返回JSONDecodeError。您可以将代码放入try-catch语句中，并检查是否发生此异常，以确保仅处理格式正确的数据。

删除未正确格式化Python的JSON对象

2 个答案: