Question

我一直在关注聊天机器人教程并且卡住了。我已经把我所在的确切步骤作为链接放在这篇文章的底部，以防你好奇我的代码是什么样的（我很沮丧，所以我逐字复制他的代码）。

在执行我的代码期间，它会在抛出异常之前处理超过26,000行。我的代码可以在下面找到。正如你所看到的，我已经尝试了各种解决方案，包括用什么都替换/ r和/ n字符，并添加标签strict=False，这应该允许未终止的字符串进入json，但这也没有用。

with open('C:/Python34/stuff/chatbot/{}/RC_{}'.format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
    for row in f:
        row_counter += 1

        if row_counter > start_row:
            try:
                row = json.loads(row.replace('\n','').replace('\r',''), strict=False)

            ---------blah blah blah blah------------ 

            except Exception as e:
                print("RUH ROH " + str(e))

，确切的错误信息如下：

RUH ROH Unterminated string starting at: line 1 column 368 (char 367)

链接： https://pythonprogramming.net/building-database-chatbot-deep-learning-python-tensorflow/

修改

摆脱try catch会在抛出错误时向我提供更多信息，可以在下面找到：

Traceback (most recent call last):
  File "C:/Python34/stuff/chatbot/chatbot_db2.py", line 103, in <module>
row = json.loads(row.replace('\n','').replace('\r',''), strict=False)
  File "C:\Python34\lib\json\__init__.py", line 331, in loads
return cls(**kw).decode(s)
  File "C:\Python34\lib\json\decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python34\lib\json\decoder.py", line 359, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 368 (char 367)

EDIT2：

跟进评论后，他们建议我打印出异常被抛出的行。它确实有所启发。

{"subreddit":"sydney","author_flair_text":null,"id":"cqugtij","gilded":0,"removal_reason":null,"downs":0,"archived":false,"created_utc":"1430439358","link_id":"t3_34e5fd","ups":6,"subreddit_id":"t5_2qkob","name":"t1_cqugtij","score_hidden":false,"author_flair_css_class":null,"parent_id":"t1_cqttsc3","controversiality":0,"score":6,"author":"SilverMeteor9798","body":"As state transport minister almost every press release from Gladys had something in there about how the liberals were \"getting on with the job\" and blaming Labor for something. It wasn't necessarily false, it just got tiresome after a while particular

虽然成功的行看起来像这样：

{"created_utc":"1430438400","ups":4,"subreddit_id":"t5_378oi","link_id":"t3_34di91","name":"t1_cqug90g","score_hidden":false,"author_flair_css_class":null,"author_flair_text":null,"subreddit":"soccer_jp","id":"cqug90g","removal_reason":null,"gilded":0,"downs":0,"archived":false,"author":"rx109","score":4,"retrieved_on":1432703079,"body":"\u304f\u305d\n\u8aad\u307f\u305f\u3044\u304c\u8cb7\u3063\u305f\u3089\u8ca0\u3051\u306a\u6c17\u304c\u3059\u308b\n\u56f3\u66f8\u9928\u306b\u51fa\u306d\u30fc\u304b\u306a","distinguished":null,"edited":false,"controversiality":0,"parent_id":"t3_34di91"}

老实说，我现在更加困惑，但看起来它的所有对象都以"}结尾。所以要么它没有结束，要么有一个字符无法被解析？

EDIT3 - 已解决

我认为该文件已完成，但我想下载它时出错，文件被一个不完整的JSON对象作为最后一个条目切断。所以删除该条目就解决了这个问题。

感谢大家的帮助

Answer 1

正如我在EDIT2中解释的那样，我打印出了给我带来麻烦的一行，并发现它并没有以}结尾，而每个JSON对象都应该这样。然后我进入文件，通过简单的搜索检查了给我带来麻烦的确切行，我发现该行不仅被截断，而且它也是我文件的最后一行。

当我下载或解压缩此文件时肯定会出现错误，并且它似乎缩短了。这反过来又引发了我得到的错误，没有解决方案似乎有效。

对于遇到此错误并且.replace（）解决方案无法正常工作的人：尝试查看您的数据，并确保其中有替换或编辑的内容。在我的情况下，在下载或提取过程中出现了截断错误，这使得这些解决方案无法实现。

非常感谢abarnert，Michael Robellard和Anton Kachurin

Answer 2

我发现Luminoso的好人写了Library来解决这类问题。

显然，有时您可能不得不处理来自其他代码的文本。文本通常通过几种不同的软件传递，每种软件都有其自身的怪癖，可能是Microsoft Office在链中的某个地方 --- see this blog post

这是ftfy进行救援的地方。

from ftfy import fix_text
import json
# text = some text source with a potential unicode problem
fixed_text = fix_text(text)
data = json.loads(fixed_text)

python json.loads未终止的字符串错误

2 个答案: