从网站刮取的图像数量限制?

时间:2018-08-09 07:07:54

标签: python-3.x beautifulsoup

我正在python3中使用Selenium和BeautifulSoup从网页上抓取图像。我使用Selenium是因为网站内容是动态生成的,并且需要登录。一切都按计划进行。...只是我只下载了300张中的101张图像。这是相关代码:

source = driver.page_source.encode('utf-8')

soup = BeautifulSoup(source, 'lxml')
avatar_images = soup.find_all('img', 'avatar__image' )
print(len(avatar_images)) # 101

urls = [img['src'] for img in avatar_images]

for index, url in enumerate(urls):
    try:
        response = requests.get(url).content
        with open("img_" + str(index) + ".jpg", 'wb') as f:
            f.write(response)
        print("Downloading '%s'" % (url))
    except requests.exceptions.HTTPError as e:
        print("%s '%s'" % (e, url))

我想知道为什么当所有300张图像均为'avatar__image'类时,avatar_images数组仅收集符合条件的前101张图像。可能是因为图片网址很长吗?例如,以下是上述打印语句中的几行:

Downloading 'https://pingboard-production.s3.amazonaws.com/user/avatars/GqcEiAOQ6SiHpLBrpAit_8a8739f9793af0363c2b16cb92c7b75135d812cf20c764d2cd23fd1adf2ee493.jpg'
Downloading 'https://pingboard-production.s3.amazonaws.com/user/avatars/2PA7hBWiSNSNsrkpHave_8a8739f9793af0363c2b16cb92c7b75135d812cf20c764d2cd23fd1adf2ee493.jpg'
Downloading 'https://cdn.filestackcontent.com/AchUBPpbtR12UdA8r3ilwz/security=policy:eyJleHBpcnkiOjIxNTAxMzIzMDYsImNhbGwiOlsicmVhZCIsImNvbnZlcnQiXSwiaGFuZGxlIjoiR3hIR2lGN1FJS3ZIYzFSS1Q0dHcifQ==,signature:83447afe4180229b9d3c8e24cd22a1b99476134bdc4ae6429971a4115e0e8616/resize=w:300,h:300,fit:crop,align:faces/rotate=d:exif/GxHGiF7QIKvHc1RKT4tw'
Downloading 'https://cdn.filestackcontent.com/AchUBPpbtR12UdA8r3ilwz/security=policy:eyJleHBpcnkiOjIxNTk0ODQ3NzAsImNhbGwiOlsicmVhZCIsImNvbnZlcnQiXSwiaGFuZGxlIjoicjhLaW5tbnRUOEI1bUwwY1VNancifQ==,signature:62995975dd025983b87cea7530b145f1fcbb89cb33376512381dee49de603fdb/resize=w:300,h:300,fit:crop,align:faces/rotate=d:exif/r8KinmntT8B5mL0cUMjw'

所有这些最初的101张图像均已按顺序成功下载,但仅此而已。我可以在avatar_images数组中遇到字符数限制吗?

所有300张图像的格式相似,并分别通过其网址加载。最后下载的图片来自此元素:

<img class="avatar__image" width="300" height="300" data-bind="attr: { src: avatarUrl, alt: name }" src="https://cdn.filestackcontent.com/AchUBPpbtR12UdA8r3ilwz/security=policy:eyJleHBpcnkiOjIxNjEwMzc2NDgsImNhbGwiOlsicmVhZCIsImNvbnZlcnQiXSwiaGFuZGxlIjoiWlcxcGRhMW5TV0NKZ0dkSnlUaU0ifQ==,signature:8e61a789abcef404ce614e4d4e4ff583dd5bf24bde78dc616fd5c73e85fdbdf7/resize=w:300,h:300,fit:crop,align:faces/rotate=d:exif/ZW1pda1nSWCJgGdJyTiM" alt="">

第一个不下载的人具有此元素:

<img class="avatar__image" width="300" height="300" data-bind="attr: { src: avatarUrl, alt: name }" src="https://pingboard-production.s3.amazonaws.com/user/avatars/UM4IhNRzeanP6DH8rS8w_8a8739f9793af0363c2b16cb92c7b75135d812cf20c764d2cd23fd1adf2ee493.jpg" alt="">

编辑我检查了soup的值,发现它仅包含101张图像。动态填充的站点必须将图像元素的数量限制为该数量,并在滚动时填充其余元素。

1 个答案:

答案 0 :(得分:0)

该网页可能使用了https://infinite-scroll.com/之类的内容。因此,要使用硒向下滚动到内容的底部,您可以使用:

driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')

然后,您需要等待页面加载。将窗口大小设置为较大的值可能会有所帮助,因为页面可能会加载更多项目。您可能需要根据网页滚动多次。

from bs4 import BeautifulSoup
from selenium import webdriver  
import time

driver = webdriver.Chrome()
driver.set_window_size(1680, 1050)
url = ("https://www.google.co.uk/search?q=test&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjPiZCr6N_cAhVCJBoKHdzhDBwQ_AUICygC&biw=1680&bih=944")
driver.get(url)
driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
# Wait for page to load.
time.sleep(5)
source = driver.page_source.encode('utf-8')
...