Question

我有一堆英语句子，我从文本文件中提取到MYSQL表。这就是我在MYSQL中创建表格的方式：

create table sentences ( ID int NOT NULL AUTO_INCREMENT ,  sentence varchar (255) , primary key (ID) ) character set = utf8;

这是我的python脚本

from bs4 import BeautifulSoup as b
import sys
from fixsentence import *
import MySQLdb as db

bound = sys.argv[1]

con = db.connect('localhost' , 'root' , 'ayrefik1' , 'knowledgebase2')
curs = con.cursor()

def gettext(file):
        temp_file = open(file)
        soup = b(temp_file)
        list = get_sentences(soup.get_text())

        for x in list:
                curs.execute('SET NAMES utf8;')
                curs.execute('insert ignore into sentences (sentence)  values (%s);', (x))
                con.commit()


gettext(bound)

我以这种方式在文件上运行脚本

python wikitext.py test

所以即使我指定该表应该能够处理UTF-8中的所有字符，我仍然收到这个错误：

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 86-87: ordinal not in range(256)

Answer 1

我猜你在执行

时使用python 2.x.

curs.execute('insert ignore into sentences (sentence)  values (%s);', (x))

如果x是unicode对象，python使用控制台的默认字符集将其编码为字符串。假设您的默认字符集是latin-1，并且此unicode对象x包含非ascii字符，python将发现它无法编码并抛出错误。您必须使用指定的字符集手动将x转换为字符串，请尝试：

curs.execute('insert ignore into sentences (sentence)  values (%s);', (x.encode('utf-8'))

未解决的“UnicodeEncodeError：'latin-1'编解码器无法对位置86-87中的字符进行编码：序数不在范围内（256）”

1 个答案: