我正在尝试创建一个简单的抓取循环以从动态页面中拾取标题。我已经制作了一个小脚本,可以按照我的预期工作。这是工作脚本:
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/user/Downloads/chromedriver_win32/chromedriver.exe')
url = "https://www.youtube.com/user/LinusTechTips/videos"
driver.get(url)
videos = driver.find_elements_by_xpath('.//*[@id="dismissable"]')
for video in videos:
title = video.find_element_by_xpath('.//*[@id="video-title"]').text
print(title)
它可以正确地在包含标题和其他详细信息的div中爬行,并刮擦标题。但是此脚本似乎仅在youtube上有效。我已经在craigslist,amazon,bookstoscrape,rightmove和hostelworld上尝试过了,但是在所有这些页面上似乎都无法使用。这是hostelworld的脚本:
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/user/Downloads/chromedriver_win32/chromedriver.exe')
url = "https://www.hostelworld.com/s?
q=New%20York,%20New%20York,%20USA&country=USA&city=New%20York&type=city&id=13&from=2020-08-
14&to=2020-08-16&guests=2&page=1"
driver.get(url)
cards = driver.find_elements_by_xpath('.//*[@id="__layout"]/div/div[1]/div[4]/div/div/div[3]')
for card in cards:
title = card.find_element_by_xpath('.//*
[@id="__layout"]/div/div[1]/div[4]/div/div/div[3]/div[2]/div[1]/h2/a').text
print(title)
我很确定通过在Chrome开发者工具中搜索找到卡片类别名称是正确的。我认为标题xpath是正确的,因为如果我在循环外使用它,则可以正确打印。我认为循环也是正确的,因为如果我将cards变量更改为:
cards = driver.find_elements_by_class_name('property-card')
它为页面上的每张卡打印一次标题。
但是,当我将.
添加到标题xpath时,它返回一条错误消息:“消息:没有这样的元素:无法找到元素:...”。我使用.
作为表达式的前缀,因此它仅搜索要遍历的父元素,而不是整个页面。但是由于某种原因,添加.
会在除youtube之外的所有我尝试过的网站上引发错误。
我试图尽可能地坚持使用xpaths,因为并非所有网站都具有良好的类和id约定。
答案 0 :(得分:1)
要获取所有属性的标题,请生成WebDriverWait
()并等待visibility_of_all_elements_located
()并跟随 css selecor 。
url = "https://www.hostelworld.com/s?q=New%20York,%20New%20York,%20USA&country=USA&city=New%20York&type=city&id=13&from=2020-08-14&to=2020-08-16&guests=2&page=1"
driver.get(url)
cards=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.property-card h2.title.title-6>a")))
for card in cards:
title = card.text
print(title)
输出:
The Local NYC
HI NYC Hostel
NY Moore Hostel
Broadway Hotel n Hostel
Q4 Hotel
American Dream Hostel
Giorgio Hotel
Freehand New York
West Side YMCA
Hotel 31
Vanderbilt YMCA
Union Hotel Brooklyn
Victorian Inn
Central Park West Hostel
Jazz on the Park Youth Hotel
The Jane
Nesva Hotel
John Hotel
请注意,您需要导入以下库。
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
已更新价格。
url = "https://www.hostelworld.com/s?q=New%20York,%20New%20York,%20USA&country=USA&city=New%20York&type=city&id=13&from=2020-08-14&to=2020-08-16&guests=2&page=1"
driver.get(url)
cards=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.property-card")))
for card in cards:
try:
title = card.find_element_by_css_selector("h2.title.title-6>a").text
print(title)
price=card.find_element_by_css_selector("p.price.title-5").text
print(price)
except:
continue
输出:
The Local NYC
€45
HI NYC Hostel
€41
NY Moore Hostel
€158
Broadway Hotel n Hostel
€73
Freehand New York
€95
Q4 Hotel
€37
Giorgio Hotel
€158
American Dream Hostel
€128
West Side YMCA
€87
Vanderbilt YMCA
€89
Hotel 31
€74
Union Hotel Brooklyn
€128
Victorian Inn
€88
Central Park West Hostel
€42
The Jane
€115
Jazz on the Park Youth Hotel
€78
Nesva Hotel
€136
John Hotel
€165