Question

我要在页面https://www.reddit.com/search?q=Expiration&type=link&sort=new上抓取所有问题的链接和标题。元素具有以下结构：

<a data-click-id="body" class="SQnoC3ObvgnGjWt90zD9Z" href="/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/">
    <h2 class="s1okktje-0 cDxKta">
        <span style="font-weight:normal">Calculating Expiration Dates - Previous Solution No Longer Works</span>
    </h2>
</a>

我使用questions = driver.find_elements_by_xpath('//a[@data-click-id="body"]')来获取问题，然后通过for对其进行迭代。我很高兴使用question.get_attribute('href')来获取链接。

但是，我不知道如何从span中提取question中的标题。

有人知道该怎么做吗？

Answer 1

硒中

question.find_elements_by_xpath.('./h2/span').text

将返回for循环中基础span元素的text元素

使用lxml

import requests
from lxml import html

UA = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}

page = requests.get('https://www.reddit.com/search?q=Expiration&type=link&sort=new',
                    headers = UA)

tree = html.fromstring(page.content)

questions = tree.xpath('//a[@data-click-id="body"]')

parsed_q = []

for question in questions:
    url = question.xpath('./@href')[0]
    title = question.xpath('./h2/span/text()')[0]
    print("Title: {} --- URL: {}".format(title,url))
    parsed_q.append(tuple([title,url]))

print(parsed_q)

Answer 2

要在webpage上刮所有问题的 title 和 href 属性，您需要为{引入 WebDriverWait {1}}，您可以使用以下解决方案：

代码块：
```
visibility_of_all_elements_located()
```

控制台输出：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.reddit.com/search?q=Expiration&type=link&sort=new")
elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))
question_title = [element.get_attribute("innerHTML") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]/h2/span")))]
question_link = [element.get_attribute("href") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))]
for i,j in zip(question_title, question_link):
    print("{} question link is {}".format(i, j))

Answer 3

尝试以下。

question.find_element_by_tag_name('span').text

或者简单地

question.text

如何使用Selenium Python从reddit.com搜索页面上的问题中提取标题和href属性

3 个答案: