使用selenium python webscraping获取滚动网页的html源代码

时间:2018-02-13 02:32:46

标签: javascript python selenium web-scraping

我正在尝试获取所有酒店,但即使我已经执行了向下滚动的脚本,我的page_source也只显示包含11家酒店的html代码,即最初加载的内容。

如何在向下滚动以获取所有酒店后获取整个数据源代码?

如果driver.execute脚本正在加载整个页面,那么如何将整个页面的页面源存储在我的变量中?

PS:这只是出于教育目的

from selenium import webdriver
import re
import pandas as pd
import time
chrome_path = r"C:\Users\ajite\Desktop\web scraping\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get('https://www.makemytrip.com/mmthtl/site/hotels/search?checkin=02252018&checkout=02262018&roomStayQualifier=1e0e&city=GOI&searchText=Goa,%20India&country=IN')

driver.implicitly_wait(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)

two_hotels = driver.find_elements_by_xpath('//*[@id="hotel_card_list"]/div')

1 个答案:

答案 0 :(得分:1)

您的滚动未执行,而不是:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 

你应该尝试:

for i in range(0,25): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 400)')
  time.sleep(1)

我试过的代码:

import selenium
import time
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("https://www.makemytrip.com/mmthtl/site/hotels/search?checkin=02252018&checkout=02262018&roomStayQualifier=1e0e&city=GOI&searchText=Goa,%20India&country=IN")
driver.implicitly_wait(3)

for i in range(0,25): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 400)')
  time.sleep(1)

time.sleep(10) #more time so the cards will load

two_hotels = driver.find_elements_by_xpath('//*[@id="hotel_card_list"]/div')

two_hotels现在有更多值

https://docs.aws.amazon.com/lambda/latest/dg/limits.html

对于i范围内的25,我获得了酒店的42值,我认为您需要调整一下这些值以获得所需的全部内容。