网络爬虫不会打印每个图像源

时间:2015-08-18 17:59:02

标签: python html image beautifulsoup web-crawler

我正在尝试创建一个网络抓取工具,它会给我指定网址中图像的所有链接,但是我找到的许多图像,同时查看页面源并使用在页面源中搜索CTRL + F 未在输出中打印。

我的代码是:

import requests
from bs4 import BeautifulSoup
import urllib
import os

print ("Which website would you like to crawl?")
website_url = raw_input("--> ")

i = 0
while i < 1:
    source_code = requests.get(website_url)  # The source code will have the page source (<html>.......</html>
    plain_text = source_code.text  # Gets only the text from the source code
    soup = BeautifulSoup(plain_text, "html5lib")
    for link in soup.findAll('img'):  # A loop which looking for all the images in the website
        src = link.get('src')  # I want to get the image URL and its located under 'src' in HTML
        if 'http://' not in src and 'https://' not in src:
            if src[0] != '/':
                src = '/' + src
            src = website_url + src
        print src
    i += 1

我应该如何在HTML页面源中的<img>中打印每个图像?

例如:网站上有以下HTML代码:

<img src="http://shippuden.co.il/wp-content/uploads/newkadosh21.jpg" *something* >

但剧本没有打印src

脚本 打印src中的<img .... src="...">

我应该如何改进代码以找到所有图片?

1 个答案:

答案 0 :(得分:0)

看一下您在示例中发布的域的主页面,我看到您引用的图像不在src上,而是在data-lazy-src属性上。

所以你应该解析两个属性:

src = link.get('src')
lazy_load_src = link.get('data-lazy-src')

实际上,在运行您显示的示例软件时,会打印图像newkadosh21的img src,但它是一个base64,如:

src="data:image/gif;base64,R0lGODdhAQABAPAAAP///wAAACwAAAAAAQABAEACAkQBADs="