Question

我认为我从根本上对不是ascii的字符集感到困惑。

我有一个python文件，我在顶部声明为# -*- coding: cp1252 -*-。

例如，在我有question = "what is your borther’s name"的文件中。

type(question)

＆GT;＆GT; STR

question

＆GT;＆GT; '什么是你的borther \ xe2 \ x80 \ x99s名称'

此时我无法转换为unicode，大概是因为你不能从ASCII转到Unicode。

UnicodeDecodeError：'ascii'编解码器无法解码位置20中的字节0xe2：序数不在范围内（128）

如果我开始声明为unicode：

question = "what is your borther’s name"

＆GT;＆GT;你的borther是什么名字'

如何获得“你的borther的名字是什么”？或者只是python解释器显示unicode字符串的方式，它实际上会在我将其传递给unicode感知应用程序（在本例中为Office）时正确编码？

我需要保留特殊字符，但我仍然需要使用Levenshtein库（pip install python-Levenshtein）进行字符串比较。

Levenshtein.ratio对其两个参数都采用str或unicode，但没有混合。

Answer 1

我有一个纯文本文件，我在顶部声明为# -*- coding: cp1252 -*-。

这没有任何作用。

with codecs.open(..., encoding='cp1252') as fp:
   ...