我要在页面https://www.reddit.com/search?q=Expiration&type=link&sort=new上抓取所有问题的链接和标题。元素具有以下结构:
<a data-click-id="body" class="SQnoC3ObvgnGjWt90zD9Z" href="/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/">
<h2 class="s1okktje-0 cDxKta">
<span style="font-weight:normal">Calculating Expiration Dates - Previous Solution No Longer Works</span>
</h2>
</a>
我使用questions = driver.find_elements_by_xpath('//a[@data-click-id="body"]')
来获取问题,然后通过for
对其进行迭代。我很高兴使用question.get_attribute('href')
来获取链接。
但是,我不知道如何从span
中提取question
中的标题。
有人知道该怎么做吗?
答案 0 :(得分:1)
硒中
question.find_elements_by_xpath.('./h2/span').text
将返回for循环中基础span元素的text元素
使用lxml
import requests
from lxml import html
UA = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}
page = requests.get('https://www.reddit.com/search?q=Expiration&type=link&sort=new',
headers = UA)
tree = html.fromstring(page.content)
questions = tree.xpath('//a[@data-click-id="body"]')
parsed_q = []
for question in questions:
url = question.xpath('./@href')[0]
title = question.xpath('./h2/span/text()')[0]
print("Title: {} --- URL: {}".format(title,url))
parsed_q.append(tuple([title,url]))
print(parsed_q)
答案 1 :(得分:1)
要在webpage上刮所有问题的 title 和 href 属性,您需要为{引入
代码块:
visibility_of_all_elements_located()
控制台输出:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.reddit.com/search?q=Expiration&type=link&sort=new")
elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))
question_title = [element.get_attribute("innerHTML") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]/h2/span")))]
question_link = [element.get_attribute("href") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))]
for i,j in zip(question_title, question_link):
print("{} question link is {}".format(i, j))
答案 2 :(得分:0)
尝试以下。
question.find_element_by_tag_name('span').text
或者简单地
question.text