使用Selenium在网页上抓取隐藏的产品详细信息

时间:2017-03-26 09:25:56

标签: python selenium web-scraping

对不起,我是一名Selenium noob并且已经做了很多阅读,但仍然无法从此页面获得产品价格(0.55英镑): https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628。使用bs4解析html时,产品详细信息不可见。使用Selenium我可以获得整个页面的字符串,并可以在那里看到价格(使用以下代码)。我应该能够以某种方式从中提取价格,但更喜欢不那么狡猾的解决方案。

browser = webdriver.Firefox(executable_path=r'C:\Users\Paul\geckodriver.exe')
browser.get('https://groceries.asda.com/product/tinned-tomatoes/asda-smart-price-chopped-tomatoes-in-tomato-juice/19560')
content = browser.page_source

如果我运行这样的事情:

elem = driver.find_element_by_id("bodyContainerTemplate")
print(elem)

它只返回:selenium.webdriver.firefox.webelement.FirefoxWebElement(session =“df23fae6-e99c-403c-a992-a1adf1cb8010”,element =“6d9aac0b-2e98-4bb5-b8af-fcbe443af906”)

价格是与此元素相关联的文本:p class =“prod-price”但我似乎无法使其正常工作。我该如何获取此文本(产品价格)?

2 个答案:

答案 0 :(得分:3)

elem的类型为WebElement。如果您需要提取web元素的文本值,可以使用以下代码:

elem = driver.find_element_by_class_name("prod-price-inner")
print(elem.text)

答案 1 :(得分:3)

试试这个解决方案,它适用于selenium和beautifulsoup

//a local utility because I don't want to repeat myself
var poll = () => async_api_call_promise("method.name", {/*Do stuff.*/});

//your pulling operation
poll().then(
    data => data.length === 0 || poll(),  //true || tryAgain
    err => {
        console.error(err);
        return poll();
    }
).then((done) => {
    //done === true
    //here you put the code that has to wait for your "loop" to finish
});

它将打印:

from bs4 import BeautifulSoup
from selenium import webdriver

url='https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628'

driver = webdriver.PhantomJS()
driver.get(url)

data = driver.page_source

soup = BeautifulSoup(data, 'html.parser')

ele = soup.find('span',{'class':'prod-price-inner'})

print ele.text

driver.quit()