如何使用Selenium Webdriver从多个页面中获取信息?

时间:2019-06-18 19:31:11

标签: python selenium selenium-webdriver selenium-chromedriver webdriverwait

我目前正试图从Bonhams网站(https://www.bonhams.com/auctions/25281/?category=results#/!)上提供的“ Hong Kong Watches 2.0”拍卖的所有拍卖品(第1页至第33页)中获取标题。我是使用python和selenium的新手,但是我尝试使用下面的代码获取结果。这段代码为我提供了我想要的结果,但仅适用于第1页。然后,该代码不断重复第1页的结果。似乎无法点击下一页的循环。有人可以帮我解决这个问题吗?

下面您可以找到我使用的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver=webdriver.Chrome()
driver.get('https://www.bonhams.com/auctions/25281/?category=results#/!')

while True:
    next_page_btn =driver.find_elements_by_xpath("//*[@id='lots']/div[2]/div[5]/div/a[10]/div")
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else:
        titles = driver.find_elements_by_xpath("//*[@class='firstLine']")
        titles = [title.text for title in titles]
        print(titles)

    element = WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'lots')))
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    element.click()

在下面找到我得到的输出。 Python会不断重复/加载此输出(我认为它执行了33次?)。

['Hong Kong Watches 2.0', '', 'OMEGA. A Very Fine And Rare Limited Edition 
Yellow Gold Chronograph Bracelet Watch, Commemorating the Apollo 11 Space 
Mission And The Successful Moon Landing in 1969', '', '', '', 'ROLEX. TWO 
SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', '', 'ROLEX. 
TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 
'', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL 
DISHES', '', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', '', 
'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', '', 'PATEK PHILIPPE. TWO 
SETS OF CUFFLINKS', '', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 
8-Days Power Reserve and Alarm', '', 'Cartier & LeCoultre. A group of 
three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', '', 
'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 
'', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with 
Alarm', '', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome 
Enamel Dial', '', 'Vacheron Constantin. A Large Polished Metal Perpetual 
Calendar Wall Clock']
['Hong Kong Watches 2.0', '', 'OMEGA. A Very Fine And Rare Limited Edition 
Yellow Gold Chronograph Bracelet Watch, Commemorating the Apollo 11 Space 
Mission And The Successful Moon Landing in 1969', '', '', '', 'ROLEX. TWO 
SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', '', 'ROLEX. 
TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 
'', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL 
DISHES', '', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', '', 
'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', '', 'PATEK PHILIPPE. TWO 
SETS OF CUFFLINKS', '', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 
8-Days Power Reserve and Alarm', '', 'Cartier & LeCoultre. A group of 
three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', '', 
'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 
'', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with 
Alarm', '', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome 
Enamel Dial', '', 'Vacheron Constantin. A Large Polished Metal Perpetual 
Calendar Wall Clock']

1 个答案:

答案 0 :(得分:0)

不需要selenium库来抓取数据。您还可以使用requestsBeautifulSoup库获取所有页面数据。

import requests
from bs4 import BeautifulSoup

headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
       "Accept": "application/json"
   }

page_num = 1
title_list = []

while True:
    url = 'https://www.bonhams.com/api/v1/lots/25281/?category=results&length=12&minimal=false&page={}'.format(page_num)
    print("===url===",url)
    response = requests.get(url,headers=headers).json()
    max_lot = response['max_lot']
    last_iSaleLotNo = 0
    titles = []
    for lot in response['lots']:
        last_iSaleLotNo = lot['lot_id_combined']
        title = BeautifulSoup(lot['styled_title'], 'lxml').find("div",{'class':'firstLine'}).text.strip()
        titles.append(title)

    title_list.append(titles)
    print("===titles===",titles)
    if int(max_lot) == int(last_iSaleLotNo):
        break

    page_num+=1

print(title_list)

首页o / p:

['ROLEX. TWO SETS OF SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1970s', 'ROLEX. TWO SETS OF RARE SHOWCASE DISPLAYS, MADE FOR ROLEX RETAILERS IN 1980s', 'PATEK PHILIPPE. A SET OF THREE RARE LIMOGES PORCELAIN AND ENAMEL DISHES', 'Bvlgari/MAUBOUSSIN. TWO SETS OF CUFFLINKS', 'BOUCHERON/MONTBLANC. TWO SETS OF CUFFLINKS', 'PATEK PHILIPPE. TWO SETS OF CUFFLINKS', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve and Alarm', 'Cartier & LeCoultre. A group of three gilt brass table clocks (Alarm/Alarm Worldtime/Engraved dial)', 'Jaeger-LeCoultre. A Gilt Brass Table Clock With 8-Days Power Reserve', 'Reuge. A Gold Plated Musical Automaton Open Face Pocket Watch with Alarm', 'Imhof. An Attractive Gilt Brass Table Clock With Polychrome Enamel Dial', 'Vacheron Constantin. A Large Polished Metal Perpetual Calendar Wall Clock']

打开浏览器网络标签,然后单击下一步,您将看到JSON响应数据,例如 enter image description here