Question

我正在尝试获取所有酒店，但即使我已经执行了向下滚动的脚本，我的page_source也只显示包含11家酒店的html代码，即最初加载的内容。

如何在向下滚动以获取所有酒店后获取整个数据源代码？

如果driver.execute脚本正在加载整个页面，那么如何将整个页面的页面源存储在我的变量中？

PS：这只是出于教育目的

from selenium import webdriver
import re
import pandas as pd
import time
chrome_path = r"C:\Users\ajite\Desktop\web scraping\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get('https://www.makemytrip.com/mmthtl/site/hotels/search?checkin=02252018&checkout=02262018&roomStayQualifier=1e0e&city=GOI&searchText=Goa,%20India&country=IN')

driver.implicitly_wait(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)

two_hotels = driver.find_elements_by_xpath('//*[@id="hotel_card_list"]/div')

Answer 1

您的滚动未执行，而不是：

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

你应该尝试：

for i in range(0,25): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 400)')
  time.sleep(1)

我试过的代码：

import selenium
import time
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("https://www.makemytrip.com/mmthtl/site/hotels/search?checkin=02252018&checkout=02262018&roomStayQualifier=1e0e&city=GOI&searchText=Goa,%20India&country=IN")
driver.implicitly_wait(3)

for i in range(0,25): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 400)')
  time.sleep(1)

time.sleep(10) #more time so the cards will load

two_hotels = driver.find_elements_by_xpath('//*[@id="hotel_card_list"]/div')

two_hotels现在有更多值

https://docs.aws.amazon.com/lambda/latest/dg/limits.html

对于i范围内的25，我获得了酒店的42值，我认为您需要调整一下这些值以获得所需的全部内容。

使用selenium python webscraping获取滚动网页的html源代码

1 个答案: