Python 2.6和3.2的问题在Windows上提供了例程

时间:2011-11-15 00:02:12

标签: python urllib python-2.6 python-3.2

以前,在python 2.6中,我已经大量使用urllib.urlopen来捕获 网页内容然后发布处理我收到的数据。现在,那些例程,以及我试图用于python 3.2的新例程正在运行到似乎只是一个窗口(甚至可能只是Windows 7的问题)。

在Windows 7上使用以下代码与python 3.2.2(64)...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("utf8"))

我收到以下消息:

Traceback (most recent call last):
  File "TATest.py", line 5, in <module>
    string = fp.read()
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 553, in _read_chunked
    self._safe_read(2)      # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

使用以下代码......

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
    print(Line.decode("utf8").rstrip('\n'))
fp.close()

我获得了相当多的网页内容,但接下来是其余的内容 被...挫败了。

Traceback (most recent call last):
  File "TATest.py", line 9, in <module>
    for Line in fp:
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 545, in _read_chunked
    self._safe_read(2)  # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

尝试阅读其他页面会产生......

Traceback (most recent call last):
  File "TATest.py", line 11, in <module>
    print(Line.decode("utf8").rstrip('\n'))
  File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>

我确实认为这是一个Windows问题,但可以使python更加强大 是什么导致它?在Linux上尝试类似的代码(版本2.6代码)时,我们没有遇到问题。有没有解决的办法?我还发布了gmane.comp.python.devel新闻组

1 个答案:

答案 0 :(得分:2)

您正在阅读的页面看起来像cp1252

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("cp1252"))

应该工作。

There are many指定内容的字符集的方法,但使用HTTP标头应该足以满足大多数页面:

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read().decode(fp.info().get_content_charset())
fp.close()
print(string)