Question

我正在尝试从网站检索所有相关的URL，但是为了呈现所有这些URL，我必须向下滚动网页，否则它将返回500个URL。

我有两个关键功能。可以获取所有相关网址的网址：

from bs4 import BeautifulSoup

from selenium import webdriver 

def scrapeCategory(url):
    url1 = url + "?max=10000"
    html = getHtmlHeadless(url1)
    site = htmlParser(html)
    links = site.findAll('a', {'class':'itemImage', 'data-e2e':'product-listing'}, href=True)
    url_list = []
    for link in links:
        url_list.append("https://www.size.co.uk"+link['href'])
    return url_list

通过指定max = 10000，我确保所有列表都在1页上（而不是从页面跳转到另一页）。

url1 = url + "?max=10000"

还有一个使用无头chromedriver检索HTML的函数：

def getHtmlHeadless(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
    options = webdriver.ChromeOptions()

    # specify headless mode
    options.add_argument('headless')

    # specify the desired user agent
    options.add_argument(f'user-agent={user_agent}')
    driver = webdriver.Chrome(executable_path='./chromedriver',options=options)

    # Ensure it is a string
    if ( type (url)!= str):
        print("The input must be a string or list of strings")
    driver.get(url)
#     driver.send_keys(Keys.PAGE_DOWN)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    html = driver.page_source
    return html

遵循我尝试应用的其他类似查询中的建议

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

和

driver.send_keys(Keys.PAGE_DOWN)

但是，由于我最多只能获得500个URL，或者在后一种情况下却出现错误，因此似乎无法完成任务。

错误：

<AttributeError: 'WebDriver' object has no attribute 'send_keys'>

我怀疑我没有放置

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

在正确的位置。但是我不知道将它放在哪里。

Answer 1

“ WebDriver”对象没有属性“ send_keys”

send_keys()是WebElement类的方法，而不是WebDriver类。

Answer 2

我知道这很老了。如果仍然有人希望通过Selenium执行滚动页面而不使用发送键方法。

last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.90);")

    time.sleep(SCROLL_PAUSE_TIME)

    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

使用Selenium chromedriver向下滚动网页

2 个答案: