Question

我正在尝试使用feedparser解析RSS提要并使用SQLAlchemy将其插入到mySQL表中。我实际上能够让这个运行得很好但是今天feed在描述中有一个带省略号字符的项目，我收到以下错误：

UnicodeEncodeError：'latin-1'编解码器无法对位置35中的字符u'\ u2026'进行编码：序数不在范围内（256）

如果我将convert_unicode = True选项添加到引擎，我可以让插件通过，但省略号不会显示它只是奇怪的字符。这似乎是有道理的，因为据我所知，拉丁语1中没有水平省略号。即使我将编码设置为utf-8，它似乎没有什么区别。如果我使用phpmyadmin进行插入并包含省略号，那就很好了。

我想我只是不理解字符编码或如何让SQLAlchemy使用我指定的字符编码。有没有人知道如何让文字没有奇怪的字符进入？

更新

我想我已经想出了这个，但我不确定为什么这很重要......

以下是代码：

import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table

COMMON_CHANNEL_PROPERTIES = [
  ('Channel title:','title', None),
  ('Channel description:', 'description', 100),
  ('Channel URL:', 'link', None),
]

COMMON_ITEM_PROPERTIES = [
  ('Item title:', 'title', None),
  ('Item description:', 'description', 100),
  ('Item URL:', 'link', None),
]

INDENT = u' '*4

def feedinfo(url, output=sys.stdout):
  feed_data = feedparser.parse(url)
  channel, items = feed_data.feed, feed_data.entries

  #adding charset=utf8 here is what fixed the problem

  db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
  metadata = MetaData(db)
  rssItems = Table('rss_items', metadata,autoload=True)
  i = rssItems.insert();

  for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
    value = channel[prop]
    if trunc:
      value = value[:trunc] + u'...'
    print >> output, label, value
  print >> output
  print >> output, "Feed items:"
  for item in items:
    i.execute({'title':item['title'], 'description': item['description'][:100]})
    for label, prop, trunc in COMMON_ITEM_PROPERTIES:
      value = item[prop]
      if trunc:
        value = value[:trunc] + u'...'
      print >> output, INDENT, label, value
    print >> output, INDENT, u'---'
  return

if __name__=="__main__":
  url = sys.argv[1]
  feedinfo(url)

这是运行没有charset选项的代码的输出/回溯：

Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com

Feed items:
     Item title: Windows 8 UI is Dropping the 'Start' Button
     Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
     Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
     ---
     Item title: Which Crashes More&#63; Apple Apps or Android Apps
     Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
     Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
     ---
Traceback (most recent call last):
  File "parse.py", line 47, in <module>
    feedinfo(url)
  File "parse.py", line 36, in feedinfo
    i.execute({'title':item['title'], 'description': item['description'][:100]})
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
    return e._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
    return connection._execute_clauseelement(elem, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
    cursor.execute(statement, parameters)
  File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
  File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
  File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)

所以看起来像将字符串添加到mysql连接字符串中就可以了。我想它默认为拉丁语1？我曾尝试将content_engine上的编码标志设置为utf8，但没有做任何事情。任何人都知道为什么当表和字段设置为utf8 unicode时它会使用latin-1？我还尝试使用.encode（'cp1252'）编码项目['description]，然后将其发送出去，即使没有将charset选项添加到连接字符串也能正常工作。这应该不适用于拉丁语-1，但显然它确实如此？我有解决方案，但我会喜欢答案：）

Answer 1

错误消息

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' 
in position 35: ordinal not in range(256)

似乎表明某些Python语言代码正在尝试将字符\u2026转换为Latin-1（ISO8859-1）字符串，并且它失败了。不足为奇的是，该字符为U+2026 HORIZONTAL ELLIPSIS，在ISO8859-1中没有单个等效字符。

您通过在SQLAlchemy连接调用中添加查询?charset=utf8来解决问题：

import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table

db = create_engine('mysql://user:pass@localhost/db?charset=utf8')

SQLAlchemy文档的Database Urls部分告诉我们，以mysql开头的URL表示使用mysql-python驱动程序的MySQL方言。

以下部分Custom DBAPI connect() arguments告诉我们查询参数将传递给基础DBAPI。

那么，mysql-python驱动程序对参数{charset: 'utf8'}做了什么？其文档的Functions and attributes部分说明了charset属性“......如果存在，连接字符集将更改为此字符集，如果它们不相等。”

要找出连接字符集的含义，我们转向MySQL 5.6参考手册的10.1.4. Connection Character Sets and Collations。总而言之，MySQL可以将传入的查询解释为与数据库的字符集不同的编码，并且与返回的查询结果的编码不同。

由于您报告的错误消息看起来像Python而不是SQL错误消息，我将推测SQLAlchemy或mysql-python中的某些内容正在尝试将查询转换为latin-1的默认连接编码发送它。这就是触发错误的原因。但是，?charset=utf8调用中的查询字符串connect()会更改连接编码，并且U+2026 HORIZONTAL ELLIPSIS能够通过。

更新：你也会问，“如果我删除了charset选项，然后使用.encode（'cp1252'）对描述进行编码，它将会很好地完成。省略号是如何获得的通过cp1252而不是unicode？“

encoding cp1252 has字节值为\x85的水平省略号字符。因此，可以将包含U+2026 HORIZONTAL ELLIPSIS的Unicode字符串编码到cp1252中而不会出错。

还要记住，在Python中，Unicode字符串和字节字符串是两种不同的数据类型。推测MySQLdb可能具有仅通过SQL连接发送字节串的策略是合理的。因此，它会将作为Unicode字符串接收的查询编码为字节字符串，但会将查询作为字节字符串单独接收。（这是推测，我没有看过源代码。）

在您发布的追溯中，最后两行（最接近错误发生的位置）显示方法名称literal，后跟unicode_literal。这倾向于支持MySQLdb将其作为Unicode字符串接收的查询编码为字节字符串的理论。

当您自己对查询字符串进行编码时，可以绕过以不同方式执行此编码的MySQLdb部分。但请注意，如果您对查询字符串进行编码的方式与MySQL连接字符集要求的编码方式不同，则编码不匹配，并且您的文本可能存储错误。

Answer 2

在连接字符串中添加charset=utf8肯定会有所帮助，但是在将convert_unicode=True添加到create_engine时我遇到了Python 2.7中的情况。 SQLAlchemy文档说它只是为了提高性能，但在我的情况下，它实际上解决了使用错误编码器的问题。

如何让SQLAlchemy正确地将unicode省略号插入到mySQL表中？

2 个答案: