我想从imdb抓取数据。为了对多个页面执行此操作,我使用了selenum软件包的click()
方法。
这是我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
pages = [str(i) for i in range(10)]
#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for page in pages:
data = soup.find_all('div', class_ = 'lister-item mode-advanced')
data_list = []
for item in data:
temp = {}
#Name of movie
temp['movie'] = item.h3.a.text
#Year
temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
#Runtime in minutes
temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
#Genre
temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
#Raiting of users
temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
#Metascore
try:
temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
except:
temp['metascore'] = None
data_list.append(temp)
#next page
continue_link = driver.find_element_by_link_text('Next')
continue_link.click()
最后我得到一个错误:
'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
(Session info: chrome=70.0.3538.102)
'
你能帮我纠正吗?
答案 0 :(得分:1)
那是因为链接文本实际上是"Next »"
,因此请尝试
continue_link = driver.find_element_by_link_text('Next »')
或
continue_link = driver.find_element_by_partial_link_text('Next')
答案 1 :(得分:1)
您还可以使用CSS选择器来定位下一个按钮的类
driver.find_element_by_css_selector('.lister-page-next.next-page').click()
此类在页面之间是一致的。您可以添加一个等待元素可点击:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))
我的理解是CSS选择器应该是一种快速匹配的方法。一些基准测试here。
答案 2 :(得分:1)
遵循以下逻辑,您可以使用新页面内容更新汤元素。我使用xpath '//a[contains(.,"Next")]'
单击下一页按钮。脚本应继续单击下一页按钮,直到没有更多按钮可以单击为止,最后脱离该按钮。试试吧:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
while True:
items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
print(items)
try:
driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
soup = BeautifulSoup(driver.page_source,"lxml")
except Exception: break