为什么这个土耳其字符在我小写时会被破坏?

时间:2017-03-19 14:07:00

标签: python utf-8 character-encoding case-sensitive turkish

我正在尝试将包含土耳其语字符的单词转换为小写。

从utf-8编码的文件中读取单词:

df <- data.frame(
        fac=c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","d","d","d","d","d","d"), 
        date=c("2017-01-01","2017-01-05","2017-01-13","2017-01-25","2017-02-10","2017-01-06","2017-01-16","2017-01-28","2017-02-02","2017-02-07","2017-01-11","2017-01-19","2017-01-24","2017-01-31","2017-02-09","2017-01-12","2017-01-24","2017-01-29","2017-02-04","2017-02-19","2017-03-08"), 
        sessions=c(1,2,3,6,5,1,3,2,3,3,1,5,3,2,4,1,3,5,2,6,6)
        )

当我尝试转换为小写时,土耳其字符gets被破坏了。但是,当我尝试转换为大写时,它工作正常。

以下是示例代码:

with open(filepath,'r', encoding='utf8') as f:
            text=f.read().lower()

以下是损坏时的外观:

this is how it is seen when it is corrupted

这里发生了什么?

一些可能有用的信息:

  • 我正在使用Windows 10 cmd提示
  • Python版本3.6.0
  • chcp设置为65001

1 个答案:

答案 0 :(得分:5)

它没有被破坏。

土耳其语有一个点缀的小写i和一个无点的小写ı,同样是一个点缀的大写İ和一个无点的大写I

将点缀的大写İ转换为小写时,这是一个挑战:如何保留信息,如果需要将其转换回大写,则应将其转换回虚线İ

Unicode解决了这个问题,如下所示:当İ转换为小写时,它实际上已转换为标准拉丁语i 加上组合字符{{ 3}}。您所看到的是您的终端无法正确呈现(或者更重要的是,避免渲染)组合字符,并且与Python无关。

您可以使用U+0307 "COMBINING DOT ABOVE"

看到这种情况正在发生
>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

...虽然,在一个正常工作且配置正确的终端中,它会呈现没有任何问题:

>>> 'İ'.lower()
'i̇'

作为旁注,如果 将其转换回大写,它将保持分解形式:

>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']

...虽然您可以将其与unicodedata.name()重新合并:

>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']

有关详细信息,请参阅: