关于python和unicode / string有很多问题。但是,没有一个答案对我有用。
首先,使用DictReader
打开文件,然后将每行放入数组中。然后发送dict值以转换为unicode。
第一步是获取数据
f = csv.DictReader(open(filename,"r")
data = []
for row in f:
data.append(row)
第二步是从字典中获取字符串值并替换重音符号(从其他帖子中找到)
s = data[i].get('Name')
strip_accents(s)
def strip_accents(s):
try: s = unicode(s)
except: s = s.encode('utf-8')
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
我使用try,除非因为某些字符串有重音符号,所以其他字符串都没有。我无法弄清楚的是,unicode(s)
适用于没有重音的type str
,但是,当type str
有重音时,它会失败
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)
我已经看过这个帖子,但答案不起作用。当我使用类型时,它表示它是<type 'str'>
。所以我尝试将文件读作unicode
f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))
但一旦阅读
data = []
for row in f:
data.append(row)
发生此错误:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
File "C:\Python27\lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte
这个错误是由dictreader处理unicode的方式引起的吗?如何解决这个问题?
更多测试。正如@univerio指出的那样,导致失败的一个项目是ISO-8859-1
将open语句修改为:
f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))
会产生稍微不同的错误:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
使用基本的open语句并修改strip_accents(),例如:
try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)
打印出类型仍为str的错误和
上的错误s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str
基于Python: Converting from ISO-8859-1/latin1 to UTF-8修改为
s = unicode(s.decode("iso-8859-1").encode('utf8'))
产生不同的错误:
except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
答案 0 :(得分:1)
我认为这应该有效:
def strip_accents(s):
s = s.decode("cp1252") # decode from cp1252 encoding instead of the implicit ascii encoding used by unicode()
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
使用正确的编码打开文件的原因不起作用是因为DictReader
似乎无法正确处理unicode字符串。
答案 1 :(得分:0)
此处引用:UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128),@ Duncan的回答,
print repr(ch)
示例:
string = 'Ka\u011f KO\u011e52 \u0131 \u0130\u00f6\u00d6 David \u00fc K\u00dc\u015f\u015e \u00e7 \u00c7'
print (repr(string))
它打印:
'Kağ KOĞ52 ı İöÖ David ü KÜşŞ ç Ç'