从网站python硒中提取按钮链接文本

时间:2019-05-24 06:13:28

标签: python-3.x selenium-webdriver web-scraping beautifulsoup web-crawler

Here是我要为其提取按钮链接文本的链接,但我无法这样做 打开网站后,我从“选择产品”中选择一个选项,假设我选择了第一个选项,即“丙烯酸涂料”,然后出现3种类型,即“底漆”,“中间体”,“完成”, 我想提取他们无法执行的文本。

import requests
from bs4 import BeautifulSoup
driver = webdriver.Chrome('~/chromedriver.exe')

driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')
lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:
    print(i)
    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    page = requests.get("http://www.asianpaintsppg.com/applications/protective_products.aspx")
    soup = BeautifulSoup(page.content, 'html.parser')
    for div in soup.findAll('table', attrs={'id':'dataLstSubCat'}):
      print(div.find('a')['href'])

但是我在这里得到空值。 任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:2)

有些选项可以不使用硒来获取子类别。尝试使用如下所示的发帖请求。

import requests
from bs4 import BeautifulSoup

url = "http://www.asianpaintsppg.com/applications/protective_products.aspx"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['txtProduct'] = '2' #This is the dropdown number
    res = s.post(url,data=payload)
    sauce = BeautifulSoup(res.text,"lxml")
    subcat = [item.text for item in sauce.select("[id^='dataLstSubCat_']")]
    print(subcat)

您可能会得到的输出:

['Primers', 'Intermediates', 'Finishes']

答案 1 :(得分:1)

您不希望.text不具有href属性,而且还需要等待条件以允许页面更新:

#dataLstSubCat a

然后在循环|理解中提取.text

items = [item.text for item in soup.select('#dataLstSubCat a')]

您可以用硒来做所有事情-您需要一个等待条件以确保内容存在,并需要一个附加等待条件以使文本在迭代1后发生更改。我使用的time.sleep是次优的。

items = [item.text for item in  WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#dataLstSubCat a")))]

其他进口:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

您可能可以使用POST请求和一个初始GET来完成全部操作,因为该页面看起来使用了__doPostBack(.aspx),其中上面下拉列表中的值用于返回子项。 / p>


from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time

driver = webdriver.Chrome() #'~/chromedriver.exe')
driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')

lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:
    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    items = [item.text for item in  WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#dataLstSubCat a")))]
    print(items)
    time.sleep(2)

答案 2 :(得分:0)

使用以下代码。它为我提供以下输出。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome('~/chromedriver.exe')
driver.get('http://www.asianpaintsppg.com/applications/protective_products.aspx')
lst_name = ['Acrylic Coatings','Glass Flake Coatings']

for i in lst_name:

    driver.find_element_by_xpath("//select[@name='txtProduct']/option[text()="+"'"+str(i)+"'"+"]").click()
    elements=WebDriverWait(driver, 10).until(expected_conditions.presence_of_all_elements_located((By.XPATH, '//table[@id="dataLstSubCat"]//tr//td//a[starts-with(@id,"dataLstSubCat_LnkBtnSubCat_")]')))
    for ele in elements:
        print(ele.text)