使用硒

时间:2018-11-19 11:26:00

标签: python selenium selenium-webdriver web-scraping selenium-chromedriver

我只想抓取黑框中包含的必需信息,并删除/删除/排除红框中包含的信息enter image description here

我这样做是因为两个框中都存在类名“ entry”和“ partial entry”。只有第一个“部分条目”包含我需要的信息,因此我打算删除/删除/排除类名“ mgrRspnInLine”。

我的代码是:

while True:
    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
    for item in container:
        try:
            element = item.find_element_by_class_name('mgrRspnInline')
            driver.execute_script("""var element = document.getElementsByClassName("mgrRspnInline")[0];element.parentNode.removeChild(element);""", element)
            WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
            element = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
            element.click()
            time.sleep(2)
            rating = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
            for rate in rating:
                rate = rate.get_attribute("class")
                rate = str(rate)
                rate = rate[-2:]
                score_list.append(rate)
            time.sleep(2)
            stay = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
            for stayed in stay:
                stayed = stayed.text
                stayed = stayed.split(', ')
                stayed.append(stayed[0])
                travel_type.append(stayed[1])
            WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
            summary = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
            for comment in summary:
                comment = comment.text
                comments.append(comment)
            WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
            rating_date = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
            for date in rating_date:
                date = date.get_attribute("title")
                date = str(date)
                review_date.append(date)
            WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
            review = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
            for comment in review:
                comment = comment.text
                print(comment)
                reviews.append(comment)
        except (NoSuchElementException) as e:
            continue
    try:
        element = WebDriverWait(driver, 100).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
        element.click()
        time.sleep(2)
    except (ElementClickInterceptedException,NoSuchElementException) as e:
        print(e)
        break

基本上,在“评论容器”中,我首先搜索了类名“ mgrRspnInLine”,然后尝试使用execute_script删除它。

但是不幸的是,输出仍然显示“ mgrRspnInLine”中包含的内容。

4 个答案:

答案 0 :(得分:2)

如果要避免通过XPath匹配第二个元素,则可以按如下所示修改XPath:

.//*[contains(@class,"partial_entry") and not(ancestor::*[@class="mgrRspnInLine"])]

仅当元素祖先没有类名"partial_entry"时,此元素才会与类名"mgrRspnInLine"匹配

答案 1 :(得分:0)

如果要第一次出现,可以使用css类选择器代替:

.partial_entry

并使用find_element_by_css_selector进行检索:

find_element_by_css_selector(".partial_entry")

答案 2 :(得分:0)

您可以使用以下方法删除所有.mgrRspnInLine元素:

driver.execute_script("[...document.querySelectorAll('.mgrRspnInLine')].map(el => el.parentNode.removeChild(el))")

答案 3 :(得分:0)

拼接Andersson的评论以及QHarr和pguardiario提供的两个答案。我终于解决了这个问题。

关键是针对容器内的一个容器,所有信息都包含在类名“ ui_column is-9”中,该类名包含在类名“ review-container”中,因此解决了Andersson对多个{{ 1}}。

在嵌套循环中,我使用了pguardianrio的建议来删除现有的多个mgrRspnInLine,然后在.partial_entry上添加QHarr的答案

mgrRspnInLine