硒的意外行为刮痧图像

时间:2018-02-20 01:03:31

标签: python python-3.x selenium if-statement web-scraping

有人可以帮助我理解为什么我的函数在这里没有返回我提供的url列表中的每个url作为参数以及为什么我得到以下输出?我只是试图返回每个项目的URL和列表以及每个URL的项目的所有相应图像。

beta_test_items = ['https://www.facebook.com/marketplace/item/2009940172578816',
 'https://www.facebook.com/marketplace/item/1591865710899243']

from selenium import webdriver
from time import sleep
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def scrape_item_details(beta_test_items):
    #finish this function
    for url in beta_test_items:
        images = []
        driver.get(url)
        sleep(3)
        image_element = driver.find_element_by_xpath('//img[contains(@class, "_5m")]')
        images = [image_element.get_attribute('src')]

        try:
            previous_and_next_buttons = driver.find_elements_by_xpath("//i[contains(@class, '_3ffr')]")
            next_image_button = previous_and_next_buttons[1]
            print(next_image_button.text)
            if  next_image_button.is_displayed():
                next_image_button.click()

                image_element = driver.find_element_by_xpath('//img[contains(@class, "_5m")]')
                print(image_element.get_attribute('src'))
                sleep(2)   

                if image_element.get_attribute('src') in images:
                    pass
                else:
                    images.append(image_element.get_attribute('src'))

            else:
                pass
        except:
            pass

        yield(url, images)

if __name__ == '__main__':

当我尝试运行它时,我得到以下输出,我不知道为什么它会在第二张照片附加到图像列表后停在第一个网址上:

In [46]: scrape_item_details(beta_items_list)
['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9']
Next
https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD
Out[46]: 
('https://www.facebook.com/marketplace/item/2009940172578816',
 ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
  'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD'])

---- ----更新 我改变了返回到yield,当我运行list(scrape_item_details(beta_test_items))时,我得到以下输出:

[('https://www.facebook.com/marketplace/item/2009940172578816',
  ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27973017_1685674758138175_781683034741350935_n.jpg?oh=e2aa32aa73f3bb9061e861bd1ea306cb&oe=5B0741FF']),
 ('https://www.facebook.com/marketplace/item/1591865710899243',
  ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27973017_1685674758138175_781683034741350935_n.jpg?oh=e2aa32aa73f3bb9061e861bd1ea306cb&oe=5B0741FF'])]

不确定为什么第一个网址的图片会重复作为第二个网址的输入?

0 个答案:

没有答案