我正在使用urllib2
打开并保存网页。但是,通常只下载网页的一部分,而有时会下载整个页面。
import urllib2
import time
import numpy as np
from itertools import izip
outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]
urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]
opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)
for fileName, url in izip(outFiles,urls):
response = urllib2.urlopen(url)
responseBody = response.read()
fp=open(fileName,'w')
fp.write(responseBody)
fp.close()
time.sleep(np.random.randint(20,40))
根据不同的运行情况,outputFile.html
的大小不同。有时文件的大小为> 200Kb高达1MB,而其他时间他们大约140KB。什么可以导致这种差异?
当文件较小时,注释部分丢失,但文件永远不会完整。有时整个页面包括评论也会下载。我查看curl
但仍有类似问题。我不明白的是导致这种不一致的原因。