Why does using replace here:
s = s.encode('ascii', 'replace')
Give me this error?:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)
Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?
(sorry I can't provide the actual string, the corpus is very large)
In any case, how do I tell python to ignore or replace characters that aren't ascii?
答案 0 :(得分:3)
Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.
That's because s.encode()
takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.
Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.
This three-way round-trip is silly, but if you really wanted to do it:
s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again