Question

以前，在python 2.6中，我已经大量使用urllib.urlopen来捕获网页内容然后发布处理我收到的数据。现在，那些例程，以及我试图用于python 3.2的新例程正在运行到似乎只是一个窗口（甚至可能只是Windows 7的问题）。

在Windows 7上使用以下代码与python 3.2.2（64）...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("utf8"))

我收到以下消息：

Traceback (most recent call last):
  File "TATest.py", line 5, in <module>
    string = fp.read()
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 553, in _read_chunked
    self._safe_read(2)      # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

使用以下代码......

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
    print(Line.decode("utf8").rstrip('\n'))
fp.close()

我获得了相当多的网页内容，但接下来是其余的内容被...挫败了。

Traceback (most recent call last):
  File "TATest.py", line 9, in <module>
    for Line in fp:
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 545, in _read_chunked
    self._safe_read(2)  # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

尝试阅读其他页面会产生......

Traceback (most recent call last):
  File "TATest.py", line 11, in <module>
    print(Line.decode("utf8").rstrip('\n'))
  File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>

我确实认为这是一个Windows问题，但可以使python更加强大是什么导致它？在Linux上尝试类似的代码（版本2.6代码）时，我们没有遇到问题。有没有解决的办法？我还发布了gmane.comp.python.devel新闻组

Answer 1

您正在阅读的页面看起来像cp1252。

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("cp1252"))

应该工作。

There are many指定内容的字符集的方法，但使用HTTP标头应该足以满足大多数页面：

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read().decode(fp.info().get_content_charset())
fp.close()
print(string)

Python 2.6和3.2的问题在Windows上提供了例程

1 个答案: