如何使用Selenium Python从reddit.com搜索页面上的问题中提取标题和href属性

时间:2019-03-07 23:23:36

标签: python selenium selenium-webdriver webdriver webdriverwait

我要在页面https://www.reddit.com/search?q=Expiration&type=link&sort=new上抓取所有问题的链接和标题。元素具有以下结构:

<a data-click-id="body" class="SQnoC3ObvgnGjWt90zD9Z" href="/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/">
    <h2 class="s1okktje-0 cDxKta">
        <span style="font-weight:normal">Calculating Expiration Dates - Previous Solution No Longer Works</span>
    </h2>
</a>

我使用questions = driver.find_elements_by_xpath('//a[@data-click-id="body"]')来获取问题,然后通过for对其进行迭代。我很高兴使用question.get_attribute('href')来获取链接。

但是,我不知道如何从span中提取question中的标题。

有人知道该怎么做吗?

3 个答案:

答案 0 :(得分:1)

硒中

question.find_elements_by_xpath.('./h2/span').text

将返回for循环中基础span元素的text元素

使用lxml

import requests
from lxml import html

UA = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}

page = requests.get('https://www.reddit.com/search?q=Expiration&type=link&sort=new',
                    headers = UA)

tree = html.fromstring(page.content)

questions = tree.xpath('//a[@data-click-id="body"]')

parsed_q = []

for question in questions:
    url = question.xpath('./@href')[0]
    title = question.xpath('./h2/span/text()')[0]
    print("Title: {} --- URL: {}".format(title,url))
    parsed_q.append(tuple([title,url]))

print(parsed_q)

答案 1 :(得分:1)

要在webpage上刮所有问题的 title href 属性,您需要为{引入 WebDriverWait {1}},您可以使用以下解决方案:

  • 代码块:

    visibility_of_all_elements_located()
  • 控制台输出:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("--disable-extensions")
    options.add_argument('disable-infobars')
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.reddit.com/search?q=Expiration&type=link&sort=new")
    elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))
    question_title = [element.get_attribute("innerHTML") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]/h2/span")))]
    question_link = [element.get_attribute("href") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))]
    for i,j in zip(question_title, question_link):
        print("{} question link is {}".format(i, j))
    

答案 2 :(得分:0)

尝试以下。

question.find_element_by_tag_name('span').text

或者简单地

question.text