Question

https://www.forrent.com/apartment-community-profile/1012635

我正在尝试解析一个网页，例如这个网页。 Selenium可能会返回此页面的部分内容，但不是全部内容。例如＆＃34; 专业管理：B＆amp;员工 ＆＃34;是在网页中，但它不是由变量＆＃39;内容＆＃39;返回的。在脚本中。知道为什么会这样，如何解决这个问题？

driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')                                                                                                     
try:                                                                                                                                                                                                        
    driver.set_page_load_timeout(20)                                                                                                                                                                       
    driver.get(url)                                                                                                                                                                                         

    #WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))                                                                                                             
    WebDriverWait(driver, 40)                                                                                                                                                                               
    html = driver.page_source                                                                                                                                                                               
    content = BeautifulSoup(html,"lxml")                                                                                                                                                                    
    driver.quit()                                                                                                                                                                                           
    return content                                                                                                                                                                                          
except TimeoutException:                                                                                                                                                                                    
    print('time out from contact')                                                                                                                                                                          
    return None

Answer 1

该内容是一个延迟加载组件。滚动后会显示它。所以你需要一个脚本向下滚动到底部。请参阅下面的代码。

driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')                                                                                                     
try:                                                                                                                                                                                                        
    driver.set_page_load_timeout(20)                                                                                                                                                                       
    driver.get(url)                                                                                                                                                                                         

    #WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))                                                                                                             
    #WebDriverWait(driver, 40)
    SCROLL_PAUSE_TIME = 0.5
    SCROLL_LENGTH = 200
    page_height = int(driver.execute_script("return document.body.scrollHeight"))
    scrollPosition = 0
    while scrollPosition < page_height:
        scrollPosition = scrollPosition + SCROLL_LENGTH
        driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
        time.sleep(SCROLL_PAUSE_TIME)

    html = driver.page_source                                                                                                                                                                               
    content = BeautifulSoup(html,"lxml")                                                                                                                                                                    
    driver.quit()                                                                                                                                                                                           
    return content                                                                                                                                                                                          
except TimeoutException:                                                                                                                                                                                    
    print('time out from contact')                                                                                                                                                                          
    return None

Selenium Webdriver无法获得一些内容

1 个答案: