我正在尝试将包含土耳其语字符的单词转换为小写。
从utf-8编码的文件中读取单词:
df <- data.frame(
fac=c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","d","d","d","d","d","d"),
date=c("2017-01-01","2017-01-05","2017-01-13","2017-01-25","2017-02-10","2017-01-06","2017-01-16","2017-01-28","2017-02-02","2017-02-07","2017-01-11","2017-01-19","2017-01-24","2017-01-31","2017-02-09","2017-01-12","2017-01-24","2017-01-29","2017-02-04","2017-02-19","2017-03-08"),
sessions=c(1,2,3,6,5,1,3,2,3,3,1,5,3,2,4,1,3,5,2,6,6)
)
当我尝试转换为小写时,土耳其字符gets被破坏了。但是,当我尝试转换为大写时,它工作正常。
以下是示例代码:
with open(filepath,'r', encoding='utf8') as f:
text=f.read().lower()
以下是损坏时的外观:
这里发生了什么?
一些可能有用的信息:
答案 0 :(得分:5)
它没有被破坏。
土耳其语有一个点缀的小写i
和一个无点的小写ı
,同样是一个点缀的大写İ
和一个无点的大写I
。
将点缀的大写İ
转换为小写时,这是一个挑战:如何保留信息,如果需要将其转换回大写,则应将其转换回虚线İ
?
Unicode解决了这个问题,如下所示:当İ
转换为小写时,它实际上已转换为标准拉丁语i
加上组合字符{{ 3}}。您所看到的是您的终端无法正确呈现(或者更重要的是,避免渲染)组合字符,并且与Python无关。
您可以使用U+0307 "COMBINING DOT ABOVE":
看到这种情况正在发生>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
...虽然,在一个正常工作且配置正确的终端中,它会呈现没有任何问题:
>>> 'İ'.lower()
'i̇'
作为旁注,如果 将其转换回大写,它将保持分解形式:
>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']
...虽然您可以将其与unicodedata.name()
重新合并:
>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
有关详细信息,请参阅: