Question

我有一个函数的代码部分，它用字符串替换编码不正确的外来字符：

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s)
# b"String from an old database with weird mixed encodings"

我需要一个真实的＆＃34;字符串，而不是字节。但是，当我想解码它们时，我有一个例外：

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s.decode("utf-8"))
# AttributeError: 'str' object has no attribute 'decode'

你知道为什么s是字节吗？
为什么我不能将它解码为真正的字符串？
你知道怎么做干净的方式吗？（今天我回来了[2：] [： - 1]。工作但非常难看，我想了解这种行为）

提前致谢！

编辑：

python3中的

pypyodbc默认使用所有unicode。那让我困惑。在连接时，您可以告诉他使用ANSI。

con_odbc = pypyodbc.connect("DSN=GP", False, False, 0, False)

然后，我可以将返回的内容转换为cp850，这是数据库的初始代码页。

str(odbc_str, "cp850", "replace")

不再需要手动替换每个特殊角色。非常感谢pepr

Answer 1

打印的b"String from an old database with weird mixed encodings"不是字符串内容的表示。它是字符串内容的值。由于您未将编码参数传递给str() ...（请参阅文档https://docs.python.org/3.4/library/stdtypes.html#str）

如果既没有给出编码也没有给出错误，str(object)返回object.__str__()，这是对象的“非正式”或可打印的字符串表示。对于字符串对象，这是字符串本身。如果对象没有__str__()方法，则str()会回退到repr(object)。

这就是你的情况。 b"实际上是两个字符，它们是字符串内容的一部分。您也可以尝试：

s1 = 'String from an old database with weird mixed encodings'
print(type(s1), repr(s1))
by = bytes(s1, 'cp1252')
print(type(by), repr(by))
s2 = str(by)
print(type(s2), repr(s2))

并打印：

<class 'str'> 'String from an old database with weird mixed encodings'
<class 'bytes'> b'String from an old database with weird mixed encodings'
<class 'str'> "b'String from an old database with weird mixed encodings'"

这就是为s[2:][:-1]为您服务的原因。

如果您更多地考虑它，那么（在我看来）或者您希望从数据库中获取bytes或bytearray（如果可能），并修复字节（请参阅bytes.translate https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#bytes.translate）或者您成功获取了字符串（幸运的是，构造该字符串时没有异常），并且您希望用正确的字符替换错误的字符（另请参阅str.translate() {{3} }）。

可能，ODBC在内部使用了错误的编码。（这是数据库的内容可能是正确的，但它被ODBC误解了，你无法告诉ODBC什么是正确的编码。）然后你想使用那个错误的编码，然后使用右编码解码字节。

Python 3.4：str：AttributeError：'str'对象没有属性'decode

1 个答案: