由于mysql数据中的无效连续字节,如何捕获UnicodeDecodeError

时间:2018-07-15 11:36:11

标签: mysql python-3.x utf-8 mysql-python unicode-string

我正在将数千万行的文本数据从mysql移到搜索引擎,并且无法成功处理其中一个检索到的字符串中的Unicode错误。我试图明确地编码和解码检索到的字符串,以使Python引发Unicode异常并了解问题出在哪里。

在我的笔记本电脑上浏览了数千万行后,引发了此异常(叹气...),但我无法捕获它,请跳过该行并继续执行我要的行。 mysql数据库中的所有文本都应该是utf-8。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte

这是我使用Mysql Connector/Python

建立的连接
cnx = mysql.connector.connect(user='root', password='<redacted>',
                          host='127.0.0.1',
                          database='bloggz',
                          charset='utf-8') 

此处提供数据库字符设置:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR 
Variable_name LIKE 'collation%';

+ ------------------------------ + ----------------- + < / p>

|变量名|值|

+ ------------------------------ + ----------------- + < / p>

| character_set_client | utf8 |

| character_set_connection | utf8 |

| character_set_database | utf8 |

| character_set_filesystem |二进制|

| character_set_results | utf8 |

| character_set_server | utf8 |

| character_set_system | utf8 |

| collat​​ion_connection | utf8_general_ci |

| collat​​ion_database | utf8_general_ci |

| collat​​ion_server | utf8_general_ci |

+ ------------------------------ + ----------------- + < / p>

下面的异常处理有什么问题?请注意,变量“ last_feeds_id”也未打印出来,但这可能只是except子句不起作用的证明。

last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

1 个答案:

答案 0 :(得分:0)

它抱怨十六进制ED。您是否预期会出现急性-i:í?如果是这样,则您所拥有的文本不是编码为UTF-8,而是cp1250,dec8,latin1,latin2,latin5中的一种。

您的Python源代码是否以

开头
# -*- coding: utf-8 -*-

请参见more Python-utf8 tips

此外,请查看“最佳做法” here

您有charset='utf-8';我不确定,但是也许应该是charset='utf8'Reference UTF-8是世界所称的字符集。 MySQL调用其3字节子集utf8。注意没有破折号。