Question

我正试图从航班搜索页面抓取一些数据。

此页面以这种方式运作：

您填写表格，然后点击按钮搜索 - 这没关系。当您单击该按钮时，您将被重定向到包含结果的页面，这就是问题所在。这个页面连续添加结果，例如一分钟，这不是什么大问题 - 问题是得到所有这些结果。当您使用真正的浏览器时，您必须向下滚动页面并显示这些结果。所以我试图使用Selenium向下滚动。它可能在页面底部向下滚动可能非常快，或者是跳转而不是滚动页面不会加载任何新结果。

当你慢慢向下滚动时，它会重新加载结果，但是如果你这么做就会停止加载。

我不确定我的代码是否有助于理解，所以我附上它。

SEARCH_STRING = """URL"""

class spider():

    def __init__(self):
        self.driver = webdriver.Firefox()

    @staticmethod
    def prepare_get(dep_airport,arr_airport,dep_date,arr_date):
        string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)
        return string


    def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):
        if isinstance(dep_airport, list):
            airports_string = str(r'%20').join(dep_airport)
            dep_airport = airports_string

        wait = WebDriverWait(self.driver, 60) # wait for results
        self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))
        wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
        wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
        self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

        self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)
        return self.driver.page_source

    @staticmethod 
    def get_info_from_borderbox(div):
        arrival = div.find('div',class_='departure').text
        price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))
        departure = div.find_all('div',class_='departure')[1].contents
        date_departure = departure[1].text 
        airport_departure = departure[5].text
        arrival = div.find_all('div', class_= 'arrival')[0].contents
        date_arrival = arrival[1].text
        airport_arrival = arrival[3].text[1:]
        print 'DEPARTURE: ' 
        print date_departure,airport_departure
        print 'ARRIVAL: '
        print date_arrival,airport_arrival

    @staticmethod
    def get_flights_from_result_page(html):

        def match_tag(tag, classes):
            return (tag.name == 'div'
                    and 'class' in tag.attrs
                    and all([c in tag['class'] for c in classes]))

        soup = mLib.getSoup_html(html)
        divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))

        for div in divs:
            spider.get_info_from_borderbox(div)

        print len(divs)


spider_inst = spider() 

print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))

所以主要问题在于我认为它滚动太快而无法触发新的结果加载。

你知道如何让它发挥作用吗？

Answer 1

同一问题，我需要它，我需要刮擦一个社交媒体网站

y = 1000
    for timer in range(0,50):
         driver.execute_script("window.scrollTo(0, "+str(y)+")")
         y += 1000  
         time.sleep(1)

每1000个睡眠是允许加载

Answer 2

这是一种不同的方法，对我有用，包括滚动到上一个搜索结果的视图并等待再次滚动之前加载其他元素：

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC


class wait_for_more_than_n_elements(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            count = len(EC._find_elements(driver, self.locator))
            return count >= self.count
        except StaleElementReferenceException:
            return False


driver = webdriver.Firefox()

dep_airport = ['BTS', 'BRU', 'PAR']
arr_airport = 'MAD'
dep_date = '2015-07-15'
arr_date = '2015-08-15'

airports_string = str(r'%20').join(dep_airport)
dep_airport = airports_string

url = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date)
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 60)
wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
wait.until(EC.invisibility_of_element_located((By.XPATH,
                                               u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))

while True:  # TODO: make the endless loop end
    results = driver.find_elements_by_css_selector("div.flightbox")
    print "Results count: %d" % len(results)

    # scroll to the last element
    driver.execute_script("arguments[0].scrollIntoView();", results[-1])

    # wait for more results to load
    wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results)))

注意：

您需要确定何时停止循环 - 例如，在特定的len(results)值
wait_for_more_than_n_elements是custom Expected Condition，有助于确定下一部分何时加载，我们可以再次滚动

Answer 3

经过一些实验，终于找到了一个好的解决方案：

    def __scroll_down_page(self, speed=8):
    current_scroll_position, new_height= 0, 1
    while current_scroll_position <= new_height:
        current_scroll_position += speed
        self.__driver.execute_script("window.scrollTo(0, {});".format(current_scroll_position))
        new_height = self.__driver.execute_script("return document.body.scrollHeight")

Answer 4

您可以使用Selenium进行平滑滚动，如下所示：

total_height = int(driver.execute_script("return document.body.scrollHeight"))

for i in range(1, total_height, 5):
    driver.execute_script("window.scrollTo(0, {});".format(i))

Answer 5

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://en.wikipedia.org")
height = browser.execute_script("return document.body.scrollHeight")
for scrol in range(100,height,100):
    browser.execute_script(f"window.scrollTo(0,{scrol})")
    time.sleep(0.1)

它对我有用。如果您想将页面滚动到最后以显示所有页面元素，这对您来说可能很有价值。如果您想提高滚动速度，请将滚动速度更改为 100 到 200。

Answer 6

在 Python Selenium 中，获取元素的 Y 位置，然后慢慢向下滚动。

y = driver.execute_script("return document.querySelector('YOUR-CSS-SELECTOR').getBoundingClientRect()['y']")
for x in range(0, int(y), 100):
    driver.execute_script("window.scrollTo(0, "+str(x)+");")

使用Selenium向下滚动页面

6 个答案: