用于在Python中读取意大利文本的编码?

时间:2013-04-04 23:35:38

标签: python visual-studio-2010 encoding

我正在使用Python Tools for Visual Studio并阅读一些用意大利语编写的文件。尝试iso-8859-1,iso-8859-2,utf-8,utf-8-sig。 Notepad ++将文件打开为UTF-8,不含BOM。

content = fp.read()
words = content.decode("utf-8-sig").lower().split()
for w in words:
    p=''
    cur.execute('SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.italian_synset s where l.id = s.id and l.lemma="%s"' % w) 

导致崩溃的字符串为C'è。 (读作"c\'\xe3\xa8"

使用chardet无济于事

Traceback (most recent call last):
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati
on4\PythonApplication4\PythonApplication4.py", line 344, in <module>
createSynsetDict()
File "C:\Users\Tathagata\Documents\Visual Studio 2012\Projects\PythonApplicati
on4\PythonApplication4\PythonApplication4.py", line 294, in createSynsetDict
cur.execute('SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.it
alian_synset s where l.id = s.id and l.lemma="%s"' % w)
File "C:\Python27\lib\site-packages\pymysql\cursors.py", line 117, in execute
self.errorhandler(self, exc, value)
File "C:\Python27\lib\site-packages\pymysql\connections.py", line 187, in defa
ulterrorhandler
raise Error(errorclass, errorvalue)
Error: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u's\
x00\x00\x00\x03SELECT word FROM  multiwordnet.italian_lemma l, multiwordnet.ital 
ian_synset s where l.id = s.id and l.lemma="c\'\xe3\xa8"', 116, 118, 'ordinal no
t in range(128)'))

1 个答案:

答案 0 :(得分:1)

假设数据库的绑定变量样式为format ...

content = fp.read()
words = content.decode("utf-8-sig").lower().split()
for w in words:
    p=''
    cur.execute('SELECT word FROM ' +
                'multiwordnet.italian_lemma l, ' +
                'multiwordnet.italian_synset s ' +
                'where l.id = s.id and l.lemma=%s', w)

请注意,我们没有在SQL字符串和传入的变量之间使用%运算符,并且我们没有在%s周围添加内部引号;相反,%s是一个占位符,用于标识SQL应该替换的位置,并且我们将要替换该占位符的值作为单独的参数传递。遵循这种做法不仅可以防止您需要处理编码问题(如果您的参数作为Python Unicode字符串传递,数据库绑定负责从那里获取),还可以防止SQL injection安全漏洞。

Python的其他数据库库可能使用不同的占位符样式;阅读文档或检查模块级paramstyle常量。 (对于qmark,您的占位符应为​​?;对于numeric,对于第一个参数,它应为冒号前缀数字:1,第二个参数应为:2,等)