我的脚本无法解析复杂网页中的项目

时间:2018-08-24 10:06:41

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经用python在python中编写了一个脚本,以从网页中刮除与每个item name相关的不同评论者。单击see more按钮时,很少有项目显示审阅者,而很少有没有审阅者。

我试图以这种方式编写脚本,以便它将从着陆页获取所有项目链接,然后滚动每个链接,然后单击review tab,然后单击{{1 }}按钮,最后收集评论者并重复同样的操作,直到没有剩余的项目为止。

这里的主要问题是,当脚本单击see more按钮时,它将打开一个包含审阅者的新标签。

Link to the landing page

Link to one of such item containing reviews

Link to the page containing full reviews

这是我到目前为止的尝试:

see more

我上面的脚本可以从第一个包含评论的可用项目中收集from urllib.parse import urljoin from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC url = "https://eatstreet.com/madison-wi/restaurants" def get_information(driver,link): driver.get(link) #collecting all the links connected to item names itemlinks = [urljoin(url,item.get_attribute("href")) for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"a.restaurant-header")))] for itemlink in itemlinks: driver.get(itemlink) #check whether there is any review revitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"label[for='reviews']"))) if revitem and (revitem.text != "Reviews (0)"): current = driver.current_window_handle wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"label[for='reviews']"))).click() wait.until(EC.visibility_of_element_located((By.LINK_TEXT,'See More Reviews'))).click() wait.until(EC.new_window_is_opened) driver.switch_to.window([window for window in driver.window_handles if window != current][0]) while True: for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,'ul.reviews div.review .review-sidebar #dropdown_user-name'))): print(item.text) try: wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".pagination-block a.next"))).click() wait.until(EC.staleness_of(item)) except Exception:break driver.switch_to.default_content() if __name__ == '__main__': options = Options() options.add_argument("--disable-notifications") driver = webdriver.Chrome(chrome_options=options) wait = WebDriverWait(driver,10) try: get_information(driver,url) finally: driver.quit() 的名称,但是当本应用于下一个项目以收集评论者的名称时,它会引发reviewers错误。发生这种情况的原因可能是,当脚本timeout exception尝试重复执行该操作时,未选中新打开的选项卡。

下图显示了如何显示“查看更多”按钮:

enter image description here

1 个答案:

答案 0 :(得分:2)

如果您需要关闭新窗口并返回到初始窗口,请尝试替换

driver.switch_to.default_content()

使用

driver.close()
driver.switch_to.window(current)