单击“更多”按钮后如何废弃网站数据

时间:2017-06-19 16:37:05

标签: python selenium web-scraping beautifulsoup

我正在尝试使用BS4 + selen来学习网页报废。网站链接为tripadvisor

评论文字有一个更多的SPAN,点击一下,使用AJAX将更多文本加载到同一个div中。

但是我的代码在“按更多”按钮单击“selenium”之前输出了评论文本。

如何使用selenium点击“更多”按钮进行报废

    from selenium import webdriver
from bs4 import BeautifulSoup


def openUrl(link):
    driver = webdriver.Firefox()
    driver.get(link)

    elem1 = driver.find_element_by_xpath("//span[@class='taLnk ulBlueLinks']")
    elem1.click()
    html_source = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html_source, 'lxml')
    foundDiv = soup.findAll("div", {"class": "review-container"})
    for reviewContainer in foundDiv:

        ratingText = reviewContainer.select_one(".partial_entry").text
        print(ratingText)

openUrl("https://www.tripadvisor.in/Hotel_Review-g1010231-d1065009-Reviews-Radisson_Blu_Resort_Spa_Alibaug-Alibaug_Raigad_District_Maharashtra.html")

但BS4在不等待更多按钮点击的情况下报废数据。

请帮忙

enter image description here

1 个答案:

答案 0 :(得分:0)

请参阅下面的WebDriverWait示例。

driver.get('https://www.tripadvisor.in/Hotel_Review-g1010231-d1065009-Reviews-Radisson_Blu_Resort_Spa_Alibaug-Alibaug_Raigad_District_Maharashtra.html')
moreButton = driver.find_element_by_css_selector("span.taLnk.ulBlueLinks")
moreButton.click()

wait = WebDriverWait(driver, 10)
element = wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div[data-reviewid='493434022'] div.loadingShade")))

html_source = driver.page_source
print(html_source)