selenium python webscrape在第一次迭代后失败

时间:2016-10-13 06:02:25

标签: python selenium web-scraping

我通过tripadvisor迭代保存评论(非翻译,原文)和翻译评论(从葡萄牙语到英语)。 因此,刮刀首先选择要显示的葡萄牙语注释,然后像往常一样将它们逐个转换为英语,并将翻译的注释保存在com_中,而扩展的非翻译注释则保存在expand_comments中。

代码适用于第一页,但从第二页开始,它无法保存翻译的注释。奇怪的是,它只是翻译每个页面的第一个评论,甚至不保存它们。

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
com_=[]
expanded_comments=[]
date_=[]
driver = webdriver.Chrome("C:\Users\shalini\Downloads\chromedriver_win32\chromedriver.exe")
driver.maximize_window()
from bs4 import BeautifulSoup

def expand_reviews(driver):
    # TRYING TO EXPAND REVIEWS (& CLOSE A POPUP)    
    try:
        driver.find_element_by_class_name("moreLink").click()
    except:
        print "err"
    try:
        driver.find_element_by_class_name("ui_close_x").click()
    except:
        print "err2"
    try:
        driver.find_element_by_class_name("moreLink").click()
    except:
        print "err3"




def save_comments(driver):
    expand_reviews(driver)
    # SELECTING ALL EXPANDED COMMENTS
    #xpanded_com_elements=driver.find_elements_by_class_name("entry")
    time.sleep(3)
    #or i in expanded_com_elements:
    #   expanded_comments.append(i.text)
    spi=driver.page_source
    sp=BeautifulSoup(spi)
    for t in sp.findAll("div",{"class":"entry"}):
        if not t.findAll("p",{"class":"partial_entry"}):
            #print t
            expanded_comments.append(t.getText())
    # Saving review date
    for d in sp.findAll("span",{"class":"recommend-titleInline"}) :
        date=d.text
        date_.append(date_)


    # SELECTING ALL GOOGLE-TRANSLATOR links
    gt= driver.find_elements(By.CSS_SELECTOR,".googleTranslation>.link")

    # NOW PRINTING TRANSLATED COMMENTS
    for i in gt:
        try:
            driver.execute_script("arguments[0].click()",i)

            #com=driver.find_element_by_class_name("ui_overlay").text
            com= driver.find_element_by_xpath(".//span[@class = 'ui_overlay ui_modal ']//div[@class='entry']")
            com_.append(com.text)
            time.sleep(5)
            driver.find_element_by_class_name("ui_close_x").click().perform()
            time.sleep(5)
        except Exception as e:
            pass

# ITERATING THROIGH ALL 200 tripadvisor webpages and saving comments & translated comments             
for i in range(200):
    page=i*10
    url="https://www.tripadvisor.com/Airline_Review-d8729164-Reviews-Cheap-Flights-or"+str(page)+"-TAP-Portugal#REVIEWS"
    driver.get(url)
    wait = WebDriverWait(driver, 10)
    if i==0:
        # SELECTING PORTUGUESE COMMENTS ONLY # Run for one time then iterate over pages
        try:
            langselction = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span.sprite-date_picker-triangle")))
            langselction.click()
            driver.find_element_by_xpath("//div[@class='languageList']//li[normalize-space(.)='Portuguese first']").click()
            time.sleep(5)
        except Exception as e:
            print e

    save_comments(driver)

1 个答案:

答案 0 :(得分:1)

您的代码中存在3个问题

  1. 内部方法save_comments(),在driver.find_element_by_class_name("ui_close_x").click().perform(),webelement的方法click()不是ActionChain,因此您无法调用perform()。因此,该行应该是这样的:
  2. driver.find_element_by_class_name("ui_close_x").click()
    
    1. 在方法save_comments()内,在com= driver.find_element_by_xpath(".//span[@class = 'ui_overlay ui_modal ']//div[@class='entry']")处,您可以找到尚未显示​​的元素。所以你必须在这一行之前添加等待。你的代码应该是这样的:
    2. wait = WebDriverWait(driver, 10)
      wait.until(EC.element_to_be_clickable((By.XPATH, ".//span[@class = 'ui_overlay ui_modal ']//div[@class='entry']")))
      com= driver.find_element_by_xpath(".//span[@class = 'ui_overlay ui_modal ']//div[@class='entry']")
      
      1. 有2个按钮可以打开评论,一个显示,一个隐藏。所以你必须跳过隐藏的按钮。
      2. if not i.is_displayed():
            continue
        driver.execute_script("arguments[0].click()",i)