使用click()方法使用BeautifulSoup进行多个页面的Web爬网

时间:2018-11-23 20:27:14

标签: python selenium-webdriver web-scraping

我想从imdb抓取数据。为了对多个页面执行此操作,我使用了selenum软件包的click()方法。

这是我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

pages = [str(i) for i in range(10)]

#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for page in pages:
    data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    data_list = []
    for item in data:
        temp = {}
    #Name of movie
        temp['movie'] = item.h3.a.text
    #Year
        temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
    #Runtime in minutes
        temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
    #Genre
        temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
    #Raiting of users
        temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
    #Metascore
        try:
            temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
        except:
            temp['metascore'] = None
        data_list.append(temp)

    #next page
    continue_link = driver.find_element_by_link_text('Next')
    continue_link.click()

最后我得到一个错误:

'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
  (Session info: chrome=70.0.3538.102)
'

你能帮我纠正吗?

3 个答案:

答案 0 :(得分:1)

那是因为链接文本实际上是"Next »",因此请尝试

continue_link = driver.find_element_by_link_text('Next »')

continue_link = driver.find_element_by_partial_link_text('Next')

答案 1 :(得分:1)

您还可以使用CSS选择器来定位下一个按钮的类

driver.find_element_by_css_selector('.lister-page-next.next-page').click()

此类在页面之间是一致的。您可以添加一个等待元素可点击:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))

我的理解是CSS选择器应该是一种快速匹配的方法。一些基准测试here

答案 2 :(得分:1)

遵循以下逻辑,您可以使用新页面内容更新汤元素。我使用xpath '//a[contains(.,"Next")]'单击下一页按钮。脚本应继续单击下一页按钮,直到没有更多按钮可以单击为止,最后脱离该按钮。试试吧:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")

while True:
    items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
    print(items)

    try:
        driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
        soup = BeautifulSoup(driver.page_source,"lxml")
    except Exception: break