出于学习目的,我正在尝试下载Buzzfeed文章的所有帖子图片。
这是我的代码:
import lxml.html
import string
import random
import requests
url ='http://www.buzzfeed.com/mjs538/messages-from-creationists-to-people-who-believe-in-evolutio?bftw'
headers = headers = {
'User-Agent': 'Mozilla/5.0',
'From': 'admin@jhvisser.com'
}
page= requests.get(url)
tree = lxml.html.fromstring(page.content)
#print(soup.prettify()).encode('ascii', 'ignore')
images = tree.cssselect("div.sub_buzz_content img")
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for x in range(size))
for image in images:
with open(id_generator() + '.jpg', 'wb') as handle:
request = requests.get(image.attrib['src'], headers=headers, stream=True)
for block in request.iter_content(1024):
if not block:
break
handle.write(block)
检索的是所有110字节大小的图像,查看它们只是一个空图像。我在这里的代码中做错了导致问题吗?如果有更简单的方法,我不必使用请求。
答案 0 :(得分:1)
如果仔细查看您尝试抓取的网页的源代码,您会发现src
代码的img
属性中未指定您想要的图片网址,但是在rel:bf_image_src
属性中。
将image.attrib['src']
更改为image.attrib['rel:bf_image_src']
可以解决您的问题。
我无法复制您的代码(它声称未安装cssselect
),但此代码BeautifulSoup和urllib2在我的计算机上顺利运行,并下载全部22张照片。
from itertools import count
from bs4 import BeautifulSoup
import urllib2
from time import sleep
url ='http://www.buzzfeed.com/mjs538/messages-from-creationists-to-people-who-believe-in-evolutio?bftw'
headers = {
'User-Agent': 'Non-commercical crawler, Steinar Lima. Contact: https://stackoverflow.com/questions/21616904/images-downloaded-are-blank-images-instead-of-actual-images'
}
r = urllib2.Request(url, headers=headers)
soup = BeautifulSoup(urllib2.urlopen(r))
c = count()
for div in soup.find_all('div', id='buzz_sub_buzz'):
for img in div.find_all('img'):
print img['rel:bf_image_src']
with open('images/{}.jpg'.format(next(c)), 'wb') as img_out:
req = urllib2.Request(img['rel:bf_image_src'], headers=headers)
img_out.write(urllib2.urlopen(req).read())
sleep(5)