Question

我试图获取页面上的所有元素，但它具有无限滚动。我尝试向下滚动页面然后获取属性，但它不是全部采摘它们？出于某种原因，我只得到了大约一半？

 driver = webdriver.Firefox()
 driver.get("http://www.amazon.com/gp/pdp/profile/A2A46BUQRGSAB0/ref=cm_cr_dp_pdp")
 lastHeight = driver.execute_script("return document.body.scrollHeight")
 while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    print newHeight, lastHeight
    if newHeight == lastHeight:
       break
   lastHeight = newHeight
tree = etree.HTML(driver.page_source)
product = tree.xpath('//span[@class="a-size-base product-title pr-multiline-ellipses-container"]//text()')[::3]
print len(product)

Answer 1

查看Selenium Python bindings doc，您可以尝试使用隐式或显式等待。来自SO Selenium random timeout exceptions without any message的答案可能有助于实现显式等待实现。

对于隐式等待，你可以尝试类似的东西（未经测试）：

def reached_bottom(driver):
    try:
        return driver.find_element_by_class_name("no-more")        
    except:
        return False    

driver = webdriver.Firefox()
driver.implicitly_wait(10)    
driver.get("http://www.amazon.com/gp/pdp/profile/A2A46BUQRGSAB0/ref=cm_cr_dp_pdp")

while not reached_bottom(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

product = ...

我使用最后显示为停止条件的no-more类，假设在到达结束时将其添加到DOM中。但是，再一次，没有测试它。

Answer 2

您需要等待滚动生效。否则，您将在更新完成之前获取源代码。

简单但不完美的解决方法是使用time.sleep足够的时间：

import time

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)  # <---
newHeight = driver.execute_script("return document.body.scrollHeight")

Selenium获取python的所有属性

2 个答案: