Question

嘿，我试图从2开始将这个小片段移植到Python 3。

Python 2：

def _download_database(self, url):
  try:
    with closing(urllib.urlopen(url)) as u:
      return StringIO(u.read())
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

Python 3：

def _download_database(self, url):
  try:
    with closing(urllib.request.urlopen(url)) as u:
      response = u.read().decode('utf-8')
      return StringIO(response)
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

但我还是得到了

utf-8 codec can't decode byte 0x8f in position 12: invalid start byte

我需要使用StringIO，因为它是一个zipfile，我想用该函数解析它：

   def _parse_zip(self, raw_zip):
  try:
     zip = zipfile.ZipFile(raw_zip)

     filelist = map(lambda x: x.filename, zip.filelist)
     db_file  = 'IpToCountry.csv' if 'IpToCountry.csv' in filelist else filelist[0]

     with closing(StringIO(zip.read(db_file))) as raw_database:
        return_val = self.___parse_database(raw_database)

     if return_val:
        self._load_data()

  except:
     self.__show_exception(sys.exc_info())
     return_val = False

  return return_val

raw_zip是download_database func

的返回

Answer 1

utf-8无法解码任意二进制数据。

utf-8是一种字符编码，可用于将文本（例如，在Python 3中表示为str类型 - 一系列Unicode代码点）编码为bytestring（bytes类型 - - 字节序列（[0,255]间隔中的小整数））并将其解码回来。

utf-8不是唯一的字符编码。有些字符编码与utf-8不兼容。即使.decode('utf-8')没有引发异常;它并不意味着结果是正确的 - 如果您使用错误的字符编码来解码文本，则可能会获得mojibake。请参阅A good way to get the charset/encoding of an HTTP response in Python。

您的输入是zip文件 - 二进制数据不是文本，因此您不应尝试将其解码为文本。

Python 3可帮助您查找与混合二进制数据和文本相关的错误。 要将代码从Python 2移植到Python 3，您应该理解文本（Unicode）与二进制数据（字节）的区别。

Python 2上的

str是一个字节字符串，可用于二进制数据和（编码）文本。除非from __future__ import unicode_literals存在; '' literal在Python 2中创建一个字节串。u''创建unicode实例。在Python 3上str类型是Unicode。 bytes指的是Python 3和Python 2.7上的字节序列（bytes是Python 2上str的别名。 b''在Python 2/3上创建bytes实例。

urllib.request.urlopen(url)返回类似文件的对象（二进制文件），您可以按原样传递，例如to decode remote gzipped content on-the-fly：

#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request

with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
                     headers={"Accept-Encoding": "gzip"})) as response, \
     GzipFile(fileobj=response) as xml_file:
    for elem in getelements(xml_file, 'interesting_tag'):
        process(elem)

ZipFile()需要seek()个文件，因此您无法直接通过urlopen()。您必须先下载内容。您可以使用io.BytesIO()来包装它：

#!/usr/bin/env python3
import io
import zipfile
from urllib.request import urlopen

url = "http://www.pythonchallenge.com/pc/def/channel.zip"
with urlopen(url) as r, zipfile.ZipFile(io.BytesIO(r.read())) as archive:
    print({member.filename: archive.read(member) for member in archive.infolist()})

StringIO()是文本文件。它在Python 3中存储Unicode。

Answer 2

如果您感兴趣的是从您的函数返回流处理程序（而不是要求解码内容），您可以使用BytesIO代替StringIO：

from contextlib import closing
from io import BytesIO
from urllib.request import urlopen

url = 'http://www.google.com'


with closing(urlopen(url)) as u:
    response = u.read()
    print(BytesIO(response))

Answer 3

您发布的链接http://software77.net/geo-ip?DL=2正在尝试下载zip文件，该文件是二进制文件。

您不应将二进制blob转换为str（只需使用BytesIO）
如果您有充分的理由这样做，请使用latin-1作为解码器。

从Python 2移植到Python 3：'utf-8编解码器无法解码字节'

3 个答案: