Question

我编写了我的程序来读取文本文件中的单词并在sqlite数据库中输入它们并将其视为字符串。但我需要输入一些含有日耳曼语的词语：äöüß。

这是一段准备好的代码：

我用＃ - - 编码：iso-8859-15 - - 和＃ - - 编码：utf-8 - - 没有区别（！）< / p>

    # -*- coding: iso-8859-15 -*-
    import sqlite3

    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

上面的代码运行良好。但我需要从包含“süß”一词的文件中读取“文本”。因此，当我取消注释3行（f.open（文件名）....），并注释 text =u'süß'时会出现错误

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

我尝试使用编解码器模块来读取utf-8，iso-8859-15。但我无法将它们解码为字符串'süß'，我需要在代码末尾完成我的句子。

在插入数据库之前，我尝试解码为utf-8。它有效，但我无法用它作为字符串。

有没有办法可以从文件中导入süß并将其用于插入sqlite和使用字符串？

更多细节：

我在这里添加更多细节以便澄清。我之前使用过codecs.open。包含süß一词的文本文件保存为utf-8。使用f=codecs.open(filename, 'r', 'utf-8')和text=f.read()，我将文件读取为unicode u'\ufeffs\xfc\xdf'。在 sqlite3 中插入此unicode顺利完成：cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))。

问题在于：sentence = "The name is: %s" %(text,)提供了u'The name is: \ufeffs\xfc\xdf'，我还需要print(text)作为我的输出süß，而print(text)带来了这个错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>。

谢谢。

Answer 1

当您打开并读取文件时，您将获得8位字符串而不是Unicode。要获取Unicode字符串，请使用codecs.open打开文件：

f=codecs.open(filename, 'r', 'utf-8')

当然，根据文件的编写方式，您可能需要使用'iso-8859-15'。

编辑：您的测试代码和注释掉的代码之间的一个大区别在于，从文件读取会产生一个列表，而测试是一个单独的字符串。也许你的问题根本与Unicode无关。尝试在测试代码中进行此替换，看看它是否会产生相同的错误：

text = [u'süß']

不幸的是，我没有足够的Python使用SQL经验来帮助你。

同样，当您打印list而不是单个字符串时，Unicode字符将替换为其等效的转义序列。要查看字符串的真实外观，请一次打印一个。如果您感到好奇，那就是__str__和__repr__之间的区别。

编辑2：字符u'\ufeff'被称为Byte Order Mark or BOM，并由一些编辑者插入以指示该文件是真正的UTF-8。你应该在使用字符串之前摆脱它。文件的最开头应该只有一个。参见例如Reading Unicode file data with BOM chars in Python

Answer 2

我可以解决这个问题。谢谢你的帮助。

这是：

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

我从this page获得了一些帮助（这是德语版）

，输出为：

The German word is: ?süß

还有一个小问题是'？'。我认为编码后unicode u'被?替换。 sentence给出：

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

和编码的句子给出：

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

所以这不是我的想法。

我想到一个简单的解决方案，摆脱问号就是使用替换功能：

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

谢谢你：）

python：打开并读取包含德语变音符号作为unicode的文件

2 个答案: