我正在编写一个带有tensorflow的python聊天机器人,它利用了https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?st=j9udbxta&sh=69e4fee7处过去几年中所有Reddit注释的转储。我通过洪流下载了评论,一切似乎都进行得很好。但是,当我将JSON文件读入python程序时,似乎没有加载整个文件。 2015年每个月的数据约为15,000KB,但是JSON仅会加载前2600行,而真正的文件则包含数十万行。当我查看从JSON文件加载的最后一行时,由于诸如此类的原因,它似乎在句子的中间被剪短了。
{"subreddit":"sydney","author_flair_text":null,"id":"cqugtij","gilded":0,"removal_reason":null,"downs":0,"archived":false,"created_utc":"1430439358","link_id":"t3_34e5fd","ups":6,"subreddit_id":"t5_2qkob","name":"t1_cqugtij","score_hidden":false,"author_flair_css_class":null,"parent_id":"t1_cqttsc3","controversiality":0,"score":6,"author":"SilverMeteor9798","body":"As state transport minister almost every press release from Gladys had something in there about how the liberals were \"getting on with the job\" and blaming Labor for something. It wasn't necessarily false, it just got tiresome after a while particular
这是我用来读取JSON文件的代码
timeframe = '2015-05'
with open("Data/reddit_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
for row in f:
row = json.loads(row)
时间范围是与05/2015中的Reddit注释相关的特定JSON文件。运行此代码时,出现此错误
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 368 (char 367)
这对我来说很有意义,因为已加载的JSON文件的最后一行被剪短了,但是如何让python读取整个JSON文件?我正在YouTube(https://www.youtube.com/watch?v=dvOnYLDg8_Y)上关注senddex的聊天机器人教程,即使我运行他的确切代码,也会遇到相同的错误。如何加载整个JSON文件,以便可以读取成千上万的注释?我试图更改缓冲,并且尝试重新下载注释。