Question

下面的python输入错误的字符串长度和错误的字符。
这里有人有什么主意吗？

>>> w ='lòng'
>>> w 
'lòng'
>>> print (w)
lòng
>>> len(w)
5
>>> for ch in w:
...     print (ch + "-") 
... 
l- 
o- 
- 
n- 
g- 
>>>

Answer 1

这里的问题是，在Unicode中，某些字符可能由其他字符的组合组成。在这种情况下，“lòng”包括小写字母“ o”和重音符号作为单独的字符。

import numpy as np
from matplotlib import animation, pyplot as plt

steps, length = 10, 10
s = (np.random.random_integers(0, 1, length) for _ in range(1000))
size = (steps, length)

x = np.zeros(size)
# Changing this to one, i.e. having non-uniform first frame, fixes animation.
x[0][:] = 0

fig = plt.figure()
ax = plt.axes()
im = ax.imshow(x, cmap="Greys", interpolation="none")

def init():
    return [im]

def animate(i):
    if i >= steps:
        global x
        x = np.roll(x, -1, axis=0)
        x[-1] = next(s)
    else:
        x[i] = next(s)
    im.set_array(x)
    return [im]

anim = animation.FuncAnimation(
    fig, animate, init_func=init, interval=1000, blit=True
)
plt.show()

这是一个可分解的Unicode字符串，因为带重音的“ o”被分解为两个字符。 unicodedata模块提供了normalize函数，用于在分解形式和组成形式之间进行转换：

>>> import unicodedata as ud
>>> w ='lòng'
>>> for c in w:
...     print(ud.name(c))
... 
LATIN SMALL LETTER L
LATIN SMALL LETTER O
COMBINING GRAVE ACCENT
LATIN SMALL LETTER N
LATIN SMALL LETTER G

如果您想知道字符串是否已规范化为特定形式，但又不想实际对其进行规范化，并且正在使用Python 3.8+，则可以使用效率更高的unicodedata.is_normalized函数（向用户Acumenus积分）：

>>> for c in ud.normalize('NFC', w):
...     print(ud.name(c))
... 
LATIN SMALL LETTER L
LATIN SMALL LETTER O WITH GRAVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G

Python文档中的Unicode HOWTO包含comparing strings的一部分，对此进行了更详细的讨论。

Answer 2

Unicode在编码字符方面具有很大的灵活性。在这种情况下，ò实际上是由 2 个Unicode代码点组成的，其中一个用于基本字符o，另一个用于重音符号。 Unicode也有一个字符，它可以同时表示两个字符，并且不在乎您使用哪个字符。 Unicode在编码字符方面具有很大的灵活性。 Python包含可以提供一致表示的软件包unicodedata。

>>> import unicodedata
>>> w ='lòng'
>>> len(w)
5
>>> len(unicodedata.normalize('NFC', w))
4

Answer 3

问题是len函数和in运算符破了 w.r.t. Unicode。

到目前为止，有两个答案声称归一化是解决方案。不幸的是，通常情况并非如此：

>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ?❤️??'
>>> len(w)
19
>>> import unicodedata
>>> len(unicodedata.normalize('NFC', w))
19
>>> # 19 is still wrong

要正确处理此任务，您需要对字素进行操作：

>>> from grapheme import graphemes
>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ?❤️??'
>>> len(list(graphemes(w)))
3
>>> # 3 is correct
>>> for g in graphemes(w):
...     print(g)
Ꙝ̛͋
ᄀᄀᄀ각ᆨᆨ
?❤️??

也可用于您的w = 'lòng'输入，无需进行任何归一化即可将其正确分割为4。

如何从字符串中获取字符-获取错误的字符和错误的字符串长度

3 个答案: