Python,XML和MySQL - ascii v utf8编码问题

时间:2016-03-22 13:27:21

标签: python mysql xml encoding utf-8

我有一个MySQL表,XML内容存储在longtext字段中,编码为u​​tf8mb4_general_ci

数据库表 enter image description here 我想使用Python脚本从transcript字段读取XML数据,修改元素,然后将值写回数据库。

当我尝试使用ElementTree.tostring将XML内容转换为元素时,我收到以下编码错误:

Traceback (most recent call last): 
File "ImageProcessing.py", line 33, 
   in <module> root = etree.fromstring(row[1])
File  "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etre‌​e/ElementTree.py", line 1300, 
   in XML parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etre‌​ e/ElementTree.py", line 1640, 
   in feed self._parser.Parse(data, 0) 

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 9568: ordinal not in range(128)

代码:

import datetime
import mysql.connector
import xml.etree.ElementTree as etree

# Creates the config parameters, connects
# to the database and creates a cursor 
config = {
  'user': 'username',
  'password': 'password',
  'host': '127.0.0.1',
  'database': 'dbname',
  'raise_on_warnings': True,
  'use_unicode': True,
  'charset': 'utf8',
}
cnx = mysql.connector.connect(**config)
cursor = cnx.cursor()

# Structures the SQL query
query = ("SELECT * FROM transcription")

# Executes the query and fetches the first row
cursor.execute(query)
row = cursor.fetchone()

while row is not None:
    print(row[0])

    #Some of the things I have tried to resolve the encoding issue
    #parser = etree.XMLParser(encoding="utf-8")
    #root = etree.fromstring(row[1], parser=parser)
    #row[1].encode('ascii', 'ignore')

    #Line where the encoding error is being thrown
    root = etree.fromstring(row[1])

    for img in root.iter('img'):
        refno = img.text
        img.attrib['href']='http://www.link.com/images.jsp?doc=' + refno
        print img.tag, img.attrib, img.text

    row = cursor.fetchone()

cursor.close()
cnx.close()

1 个答案:

答案 0 :(得分:0)

你已经完成了所有设置并且你的数据库连接正在返回Unicodes,这是一件好事。

不幸的是,ElementTree的fromstring()需要字节str而不是Unicode。这是因为ElementTree可以使用XML标头中定义的编码对其进行解码。

你需要改用它:

utf_8_xml = row[1].encode("utf-8")
root = etree.fromstring(utf_8_xml)