Question

我有这个脚本从IG下载图像。我唯一的问题是，当selenium开始向下滚动到网页的底部时，美丽的汤开始在请求被循环后抓取相同的img src链接。虽然它会继续向下滚动并下载图片，但在完成所有操作后，我最终会有2或3个副本。所以我的问题是，是否有办法防止它发生？

import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver


url = ('https://www.instagram.com/kitties)
driver = webdriver.Firefox()
driver.get(url)

scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    soup = BeautifulSoup(driver.page_source, 'lxml')
    imgs = soup.find_all('img', class_='_2di5p')
    for img in imgs:
        img_url = img["src"]
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    if new_height == last_height:
        break
    last_height = new_height

更新：所以我将这部分代码放在While True之外，让selenium首先加载整个页面然后希望有bs4来刮掉所有图像。它仅适用于30号，然后停止。

soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
    #tn = datetime.now().strftime('%H:%M:%S')
    img_url = img["src"]
    print('=> [+] img_{}'.format(counter))
    screens(counter)
    counter = counter + 1

Answer 1

它只在你的第二个版本的脚本中加载30的原因是因为其余的元素从页面DOM中删除而不再是BeautifulSoup看到的源代码的一部分。解决方案是继续执行您第一次执行的操作，但在遍历列表并调用screens()之前删除任何重复的元素。您可以执行以下using sets操作，但我不确定这是否是绝对最有效的方法：

import requests
import selenium.webdriver as webdriver
import time

driver = webdriver.Firefox()

url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)

scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

old_imgs = set()

while True:

    imgs = driver.find_elements_by_class_name('_2di5p')

    imgs_dedupe = set(imgs) - set(old_imgs)

    for img in imgs_dedupe:
        img_url = img.get_attribute("src")
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    old_imgs = imgs

    if new_height == last_height:
        break
    last_height = new_height

driver.quit()

正如你所看到的，我用了一个不同的页面来测试它，一个有420张猫的图像。结果是420张图片，该帐户上的帖子数量，其中没有重复。

Answer 2

我会使用os库来检查文件是否已存在

import os


def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        if os.path.isfile(path/to/the/file):      #checks file exists. Gives false on directory
    # or if os.path.exists(path/to/the/file): #checks file/directory exists
            pass
        else:
            r = requests.get(img_url)
            f.write(r.content)

*我可能搞砸了if和with语句的排序

Selenium卷轴和美化汤循环

2 个答案: