Question

我正在阅读一些主要的十六进制输入到Python3脚本中。但是，系统设置为使用UTF-8，当从Bash shell管道到脚本中时，我保持得到以下UnicodeDecodeError error：

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

根据其他SO答案，我正在Python3中使用sys.stdin.read()读取管道输入，例如：

import sys
...
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
    try:
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)
...

使用以下方式进行配管时可以使用：

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

但是，使用原始格式不会：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"

    ▒▒▒
   ▒▒

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

并尝试了其他有希望的SO答案：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

到目前为止，我了解到的是，当您的终端遇到UTF-8序列时，它是expecting，后面跟着1-3个其他字节，如下所示：

UTF-8是一种可变宽度的字符编码，能够使用一到四个 8位字节对Unicode中的所有有效代码点进行编码。因此，在前导字节（0xC2 - 0xF4范围内的第一个UTF-8字符）之后的所有内容，后跟 1-3 连续字节，在范围0x80 - 0xBF。

但是，我不能总是确定输入流的来源，它很可能是原始数据，而不是上面的ASCII HEX版本。因此，我需要以某种方式处理此原始输入。

我研究了一些替代方法，例如：

使用codecs.decode
将open("myfile.jpg", "rb", buffering=0)与raw i/o一起使用
使用bytes中的bytes.decode(encoding="utf-8", errors="ignore")
或仅使用open(...)

但是我不知道他们是否或者如何读取管道输入流，就像我想要的那样。

如何使我的脚本也处理原始字节流？

PS。是的，我已经阅读了类似SO问题的负载，但是它们都没有充分处理此UTF-8输入错误。最好的是this one。

这不是重复项。

Answer 1

这是一种像文件一样读取二进制stdin的简单方法：

ext {
    supportLibVersion = '28.0.0'
}

dependencies {
    // ... Other dependencies 
    implementation "com.android.support:appcompat-v7:$supportLibVersion"
    implementation ("com.android.support:support-v4:$supportLibVersion"){
        force = true
    }
    implementation ("com.android.support:exifinterface:$supportLibVersion"){
        force = true
    }
}

Answer 2

我终于设法通过{strong>不使用sys.stdin来解决此问题！

相反，我使用了with open(0, 'rb')。哪里：

0是等效于 stdin 的文件指针。
'rb'使用 binary 模式进行读取。

这似乎在尝试解释管道中的 locale 字符时绕过了 system 的问题。在看到以下内容后，我得到了这个主意，并返回了正确的（不可打印的）字符：

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

▒▒▒
   ▒▒

因此，为了正确读取任何管道数据，我使用了：

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

这会将您的管道数据读取到python byte字符串中。

下一个问题是确定管道数据是来自字符串（如echo "BADDATA0"）还是来自二进制流。后者可以由echo -ne "\xBA\xDD\xAT\xA0"模拟，如OP中所示。就我而言，我只是使用RegEx查找非ASCII字符的边界。

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

当然可以做得更好，更聪明。（随时发表评论！）

附录（来自https://my-deployed-instance.azurewebsites.net/）

模式是一个可选字符串，用于指定打开文件的模式。默认为r，这意味着可以在文本模式下阅读。在文本模式中，如果未指定编码，则使用的编码取决于平台：调用locale.getpreferredencoding(False)以获取当前的语言环境编码。（要读取和写入原始字节，请使用二进制模式，而未指定编码。）默认模式为“ r”（开放用于读取文本，为“ rt”的同义词）。对于 binary 读写访问，模式w+b打开并将文件截断为0个字节。 r+b不会截断地打开文件。

... Python区分二进制I / O和文本I / O。以二进制模式打开的文件（包括mode参数中的b）以 bytes object 的形式返回内容，而没有任何解码。在文本模式下（默认设置，或者在模式参数中包含t时），文件的内容以 str 的形式返回，首先使用与平台相关的字节对字节进行解码编码或使用指定的编码（如果有的话）。

如果 closefd 为False，并且给出了文件描述符而不是文件名，则关闭文件时，底层文件描述符将保持打开状态。如果指定了文件名，则 closefd 必须为True（默认值），否则将引发错误。

Answer 3

使用sys.stdin.buffer.raw代替sys.stdin

从sys.stdin中读取管道输入时如何防止“ UnicodeDecodeError”？

3 个答案: