我正在尝试创建一个网络抓取工具,它会给我指定网址中图像的所有链接,但是我找到的许多图像,同时查看页面源并使用在页面源中搜索CTRL + F 未在输出中打印。
我的代码是:
import requests
from bs4 import BeautifulSoup
import urllib
import os
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
i = 0
while i < 1:
source_code = requests.get(website_url) # The source code will have the page source (<html>.......</html>
plain_text = source_code.text # Gets only the text from the source code
soup = BeautifulSoup(plain_text, "html5lib")
for link in soup.findAll('img'): # A loop which looking for all the images in the website
src = link.get('src') # I want to get the image URL and its located under 'src' in HTML
if 'http://' not in src and 'https://' not in src:
if src[0] != '/':
src = '/' + src
src = website_url + src
print src
i += 1
我应该如何在HTML页面源中的<img>
中打印每个图像?
例如:网站上有以下HTML代码:
<img src="http://shippuden.co.il/wp-content/uploads/newkadosh21.jpg" *something* >
但剧本没有打印src
。
脚本 打印src
中的<img .... src="...">
我应该如何改进代码以找到所有图片?
答案 0 :(得分:0)
看一下您在示例中发布的域的主页面,我看到您引用的图像不在src上,而是在data-lazy-src属性上。
所以你应该解析两个属性:
src = link.get('src')
lazy_load_src = link.get('data-lazy-src')
实际上,在运行您显示的示例软件时,会打印图像newkadosh21的img src,但它是一个base64,如:
src="data:image/gif;base64,R0lGODdhAQABAPAAAP///wAAACwAAAAAAQABAEACAkQBADs="