Question

Why does using replace here:

s = s.encode('ascii', 'replace')

Give me this error?:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)

Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?

(sorry I can't provide the actual string, the corpus is very large)

In any case, how do I tell python to ignore or replace characters that aren't ascii?

Answer 1

Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.

That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.

Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.

This three-way round-trip is silly, but if you really wanted to do it:

s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again

why does pythons `s.encode('ascii', 'replace')` fails encoding

1 个答案:

why does pythons `s.encode(&#39;ascii&#39;, &#39;replace&#39;)` fails encoding

1 个答案:

why does pythons `s.encode('ascii', 'replace')` fails encoding