Question

尽管Python 3.x解决了某些语言环境（例如tr_TR.utf8）的大写和小写问题，但Python 2.x分支缺少这一点。对此问题的几种解决方法如https://github.com/emre/unicode_tr/，但不喜欢这种解决方案。

所以我正在为猴子修补unicode类实现一个新的upper / lower / capitalize / title方法 string.maketrans方法。

maketrans的问题是两个字符串的长度必须具有相同的长度。我想到的最近的解决方案是＆＃34;如何将1字节字符转换为2字节？＆＃34;

注意： translate方法仅对ascii编码有效，当我将u'İ'（1个字节长度\ u0130）作为translate的参数传递给出ascii编码时错误。

from string import maketrans

import unicodedata
c1 = unicodedata.normalize('NFKD',u'i').encode('utf-8')
c2 = unicodedata.normalize('NFKD',u'İ').encode('utf-8')
c1,len(c1)
('\xc4\xb1', 2)

# c2,len(c2)
# ('I', 1)
'istanbul'.translate( maketrans(c1,c2))
ValueError: maketrans arguments must have same length

Answer 1

Unicode对象允许通过字典进行多字符转换，而不是通过maketrans映射的两个字节字符串。

#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
print u'istanbul'.translate(D)

输出：

İstanbul

如果您以ASCII字节字符串开头并希望结果为UTF-8，只需对翻译进行解码/编码：

#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
s = 'istanbul'.decode('ascii')
t = s.translate(D)
s = t.encode('utf8')
print repr(s)

输出：

'\xc4\xb0stanbul'

以下技术可以完成maketrans的工作。请注意，字典键必须是Unicode序号，但值可以是Unicode序号，Unicode字符串或None。如果None，则在翻译时删除该字符。

#!python2
#coding:utf8
def maketrans(a,b):
    return dict(zip(map(ord,a),b))
D = maketrans(u'àáâãäå',u'ÀÁÂÃÄÅ')
print u'àbácâdãeäfåg'.translate(D)

输出：

ÀbÁcÂdÃeÄfÅg

参考：str.translate

Python如何将8位ASCII字符串转换为16位Unicode

1 个答案: