Question

在我的HTML文件中，单词“Schilderung”看起来正常，它似乎没有（编码？）问题。但是当我复制这个词时，我得到以下内容：“Schilde梯级”，如果我想用python找出长度，我得到13（而不是12 ......）。

这里有什么问题，我该如何处理？

非常感谢您的帮助！

编辑：目前，我使用以下内容：output.write(text.decode("utf-8")) 这正确处理所有变音符号和其他特殊字符，但上述问题仍然存在。 print（repr（txt））给出：Schilde \ xc2 \ xadrung 我们怎样才能解决这个问题？非常感谢！

Answer 1

字符串中的r之前有U+00AD SOFT HYPHEN：

>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'

删除非ascii字符：

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Answer 2

似乎"r"不是ASCII：

>>> u'Schilderung'
u'Schilde\xadrung'