以前,在python 2.6中,我已经大量使用urllib.urlopen来捕获 网页内容然后发布处理我收到的数据。现在,那些例程,以及我试图用于python 3.2的新例程正在运行到似乎只是一个窗口(甚至可能只是Windows 7的问题)。
在Windows 7上使用以下代码与python 3.2.2(64)...
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
string = fp.read()
fp.close()
print(string.decode("utf8"))
我收到以下消息:
Traceback (most recent call last):
File "TATest.py", line 5, in <module>
string = fp.read()
File "d:\python32\lib\http\client.py", line 489, in read
return self._read_chunked(amt)
File "d:\python32\lib\http\client.py", line 553, in _read_chunked
self._safe_read(2) # toss the CRLF at the end of the chunk
File "d:\python32\lib\http\client.py", line 592, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
使用以下代码......
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
print(Line.decode("utf8").rstrip('\n'))
fp.close()
我获得了相当多的网页内容,但接下来是其余的内容 被...挫败了。
Traceback (most recent call last):
File "TATest.py", line 9, in <module>
for Line in fp:
File "d:\python32\lib\http\client.py", line 489, in read
return self._read_chunked(amt)
File "d:\python32\lib\http\client.py", line 545, in _read_chunked
self._safe_read(2) # toss the CRLF at the end of the chunk
File "d:\python32\lib\http\client.py", line 592, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
尝试阅读其他页面会产生......
Traceback (most recent call last):
File "TATest.py", line 11, in <module>
print(Line.decode("utf8").rstrip('\n'))
File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>
我确实认为这是一个Windows问题,但可以使python更加强大 是什么导致它?在Linux上尝试类似的代码(版本2.6代码)时,我们没有遇到问题。有没有解决的办法?我还发布了gmane.comp.python.devel新闻组
答案 0 :(得分:2)
您正在阅读的页面看起来像cp1252
。
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
string = fp.read()
fp.close()
print(string.decode("cp1252"))
应该工作。
There are many指定内容的字符集的方法,但使用HTTP标头应该足以满足大多数页面:
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
string = fp.read().decode(fp.info().get_content_charset())
fp.close()
print(string)