如何使用Selenium Python从网站中提取产品标题

时间:2019-04-11 08:52:19

标签: selenium xpath web-scraping css-selectors webdriverwait

我正在尝试从网站上抓取标题,但它仅返回1个标题。如何获得所有标题?

以下是我试图使用xpath(以-开始)获取的元素之一:

<div id="post-4550574" class="post-box    " data-permalink="https://hypebeast.com/2019/4/undercover-nike-sfb-mountain-sneaker-release-info" data-title="The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date"><div class="post-box-image-container fixed-ratio-3-2">

这是我当前的代码:

from selenium import webdriver
import requests
from bs4 import BeautifulSoup as bs

driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get('https://hypebeast.com/search?s=nike+undercover')

element = driver.find_element_by_xpath(".//*[starts-with(@id, 'post-')]")
print(element.get_attribute('data-title'))

输出: The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date

我期待更多的冠军,但只返回一个结果。

4 个答案:

答案 0 :(得分:1)

要从website中提取产品标题,因为所需元素是JavaScript启用的元素,您需要为{{生成 WebDriverWait 1}},则可以使用以下任何Locator Strategies

  • visibility_of_all_elements_located()

    XPATH
  • driver.get('https://hypebeast.com/search?s=nike+undercover') print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2/span")))])

    CSS_SELECTOR
  • 控制台输出:

    driver.get('https://hypebeast.com/search?s=nike+undercover')
    print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2>span")))])
    

答案 1 :(得分:1)

您不需要硒。您可以使用速度更快的requests,并定位data-title属性

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://hypebeast.com/search?s=nike+undercover')
soup = bs(r.content, 'lxml')
titles = [item['data-title'] for item in soup.select('[data-title]')]
print(titles)

如果您确实希望硒匹配语法是

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://hypebeast.com/search?s=nike+undercover')
titles = [item.get_attribute('data-title') for item in driver.find_elements_by_css_selector('[data-title]')]
print(titles)   

答案 2 :(得分:0)

如果定位器找到多个元素,则find_elemnt返回第一个元素。 find_elements返回定位器找到的所有元素的列表。
然后,您可以迭代列表并获取所有元素。

如果您要查找的所有元素都具有类post-box,则可以按类名找到这些元素。

答案 3 :(得分:0)

只是分享我的经验和我使用过的东西,可能会对某人有所帮助。就用吧,

element.get_attribute('ATTRIBUTE-NAME')