我有一个用于下载可疑内容webcomic的脚本。看起来它运行正常,但它下载的文件是空的,只有几kb。
#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os
#RegExp for the EndNum variable
RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*')
#Check the main QC page
site = urllib.urlopen("http://questionablecontent.net/")
contentLine = None
#For each line in the homepage's source...
for line in site.readlines():
#Break when you find the variable information
if RegExp.search(line):
contentLine = line
break
#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
contentLine = contentLine.split('/')
contentLine = contentLine[4].split('.')
EndNum = int(contentLine[0])
else:
EndNum = 2622
#First and Last comics user wishes to download
StartNum = 1
#EndNum = 2622
#Full path of destination folder needs to pre-exist
destinationFolder = "D:\Downloads\Comics\Questionable Content"
#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):
#IF you already have the comic, skip downloading it
if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
print "Skipping Comic "+str(i)+"..."
continue
#Printing User-Friendly Messages
print "Comic %d Found. Downloading..." % i
source = "http://www.questionablecontent.net/comics/"+str(i)+".png"
#Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
为什么一直在下载空文件?有没有解决方法?
答案 0 :(得分:0)
此处的问题是,如果未设置用户代理,则服务器不会为您提供图像。下面是Python 2.7的示例代码,它可以让您了解如何使脚本正常工作。
import urllib2
import time
first = 1
last = 2622
for i in range(first, last+1):
time.sleep(5) # Be nice to the server! And avoid being blocked.
for ext in ['png', 'gif']:
# Make sure that the img dir exists! If not, the script will throw an
# IOError
with open('img/{}.{}'.format(i, ext), 'wb') as ifile:
try:
req = urllib2.Request('http://www.questionablecontent.net/comics/{}.{}'.format(i, ext))
req.add_header('user-agent', 'Mozilla/5.0')
ifile.write(urllib2.urlopen(req).read())
break
except urllib2.HTTPError:
continue
else:
print 'Could not find image {}'.format(i)
continue
print 'Downloaded image {}'.format(i)
您可能希望将循环更改为类似于循环的循环(检查图像是否先前已下载等)。此脚本会尝试将所有图片从<start>.<ext>
下载到<end>.<ext>
,其中<ext>
是gif或png。