urllib2提供不一致的输出 - 通常会下载部分网页

时间:2012-02-02 20:11:18

标签: python curl download urllib2 urllib

我正在使用urllib2打开并保存网页。但是,通常只下载网页的一部分,而有时会下载整个页面。

import urllib2
import time
import numpy as np
from itertools import izip

outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]

urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]

opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)

for fileName, url in izip(outFiles,urls):
    response = urllib2.urlopen(url)
    responseBody = response.read()
    fp=open(fileName,'w')
    fp.write(responseBody)
    fp.close()
    time.sleep(np.random.randint(20,40))

根据不同的运行情况,outputFile.html的大小不同。有时文件的大小为> 200Kb高达1MB,而其他时间他们大约140KB。什么可以导致这种差异?

当文件较小时,注释部分丢失,但文件永远不会完整。有时整个页面包括评论也会下载。我查看curl但仍有类似问题。我不明白的是导致这种不一致的原因。

0 个答案:

没有答案