https://www.forrent.com/apartment-community-profile/1012635
我正在尝试解析一个网页,例如这个网页。 Selenium可能会返回此页面的部分内容,但不是全部内容。例如" 专业管理:B&员工 "是在网页中,但它不是由变量'内容'返回的。在脚本中。知道为什么会这样,如何解决这个问题?
driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')
try:
driver.set_page_load_timeout(20)
driver.get(url)
#WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))
WebDriverWait(driver, 40)
html = driver.page_source
content = BeautifulSoup(html,"lxml")
driver.quit()
return content
except TimeoutException:
print('time out from contact')
return None
答案 0 :(得分:2)
该内容是一个延迟加载组件。滚动后会显示它。所以你需要一个脚本向下滚动到底部。请参阅下面的代码。
driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')
try:
driver.set_page_load_timeout(20)
driver.get(url)
#WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))
#WebDriverWait(driver, 40)
SCROLL_PAUSE_TIME = 0.5
SCROLL_LENGTH = 200
page_height = int(driver.execute_script("return document.body.scrollHeight"))
scrollPosition = 0
while scrollPosition < page_height:
scrollPosition = scrollPosition + SCROLL_LENGTH
driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
time.sleep(SCROLL_PAUSE_TIME)
html = driver.page_source
content = BeautifulSoup(html,"lxml")
driver.quit()
return content
except TimeoutException:
print('time out from contact')
return None