Question

Python 2.7中的字符串切片对于获取子字符串非常有用。这适用于ASCII字符，例如

>>> s = "Antonio"
>>> s[5:7]
'io'

但在有重音字符的情况下失败，例如

>>> s = "António"
>>> s[5:7]
'ni'

无论原始字符串中是否存在字符，获取正确子字符串的安全方法是什么？

更新我的配置信息如下：

Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 12:54:16) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

由于

Answer 1

在Python 2.7中，字符串和unicode字符串是不同的对象。要声明Unicode字符串文字，请在其前面添加u：

Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "António"
>>> len(s)
8
>>> s2 = u"António"
>>> len(s2)
7
>>> s[5:7]
'ni'
>>> s2[5:7]
u'io'

Answer 2

我终于找到了问题的答案。我只需要读取这样的文本文件：

import codecs
with codecs.open(ficheiro, encoding='utf-8') as fin:
    for line in fin:
       ...  # then here line[5:7] will work correctly for "António" and "Antonio"

感谢编写Solving Unicode Problems in Python 2.7

的Derek Dohler

字符串切片不适用于重音字符

2 个答案: