Question

您好我有以下代码：

from __future__ import print_function
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json

import pandas as pd
import re
import threading
import pickle

import sqlite3
#from treetagger import TreeTagger

conn = sqlite3.connect('Telcel.db')
cursor = conn.cursor()
cursor.execute('select id_comment from Tweets')
id_comment = [i for i in cursor]
cursor.execute('select id_author from Tweets')
id_author = [i for i in cursor]
cursor.execute('select comment_message from Tweets')
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
cursor.execute('select comment_published from Tweets')
comment_published = [i for i in cursor]

这在python 2.7.12中运行良好，输出：

~/data$ python DBtoList.py 
8003
8003
8003
8003

但是当我使用python3运行相同的代码时，我得到了：

~/data$ python3 DBtoList.py 
Traceback (most recent call last):
  File "DBtoList.py", line 21, in <module>
    comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
  File "DBtoList.py", line 21, in <listcomp>
    comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
sqlite3.OperationalError: Could not decode to UTF-8 column 'comment_message' with text 'dancing music ������'

我搜索了这一行，我找到了：

"dancing music "

我不确定为什么代码在python 2中运行，似乎python Python 3.5.2无法在此行解码此字符：

comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]

所以我想感谢您解决此问题的建议，感谢您的支持

Answer 1

如果使用Python sqlite3 API存储它，Python 3对字符串本身没有任何问题。我已经将utf-8设置为我的默认编码。

import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('create table Tweets (comment_message text)')
conn.execute('insert into Tweets values ("dancing music ")')
[(tweet,) ] = conn.execute('select comment_message from tweets')

tweet

输出：

'dancing music '

现在，让我们看看类型：

>>> type(tweet)
str

如果你从一开始就使用Python str，那么一切都很好。

现在，顺便说一下，你要做的事情（编码utf-8，解码latin-1）没什么意义，特别是如果你在字符串中有像emojis这样的东西。看看你的推文会发生什么：

>>> tweet.encode('utf-8').decode('latin-1')
'dancing music ð\x9f\x98\x9c'

但是现在你的问题：你使用与utf-8不同的编码在数据库中存储了字符串（字节序列）。您看到的错误是由sqlite3库尝试解码这些字节序列而导致失败，因为字节不是有效的utf-8序列。解决这个问题的唯一方法是：

找出用于编码数据库中字符串的编码
使用该编码通过设置conn.text_factory = lambda x: str(x, 'latin-1')来解码字符串。这假设您已使用latin1存储字符串。

然后我会建议您运行数据库并更新值，以便现在使用utf-8对它们进行编码，这是默认行为。

另见this question。

我还强烈建议您阅读有关编码如何工作的this article。

如何避免以下问题，使用python3？

1 个答案: