我想扼杀香港立法的内容。但是,除非我向下滚动页面,否则无法访问不可见的内容。
我正在访问的网站:https://www.elegislation.gov.hk/hk/cap211
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import ElementNotVisibleException
from selenium.webdriver.common.action_chains import ActionChains
def init_driver(profile):
driver = webdriver.Firefox(profile)
driver.wait = WebDriverWait(driver, 5)
return driver
def convert2text2(webElement):
if webElement != []:
webElements = []
for element in webElement:
e = element.text.encode('utf8')
webElements.append(e)
else:
webElements = ['NA']
return webElements
profile = webdriver.FirefoxProfile()
driver = init_driver(profile)
url = 'https://www.elegislation.gov.hk/hk/cap211'
driver.get(url)
driver.wait = WebDriverWait(driver, 5)
content = driver.find_elements_by_xpath("//div[@class='hklm_content' or @class='hklm_leadIn' or @class='hklm_continued']")
content = convert2text2(content)
了解从How can I scroll a web page using selenium webdriver in python?获取的以下代码用于滚动浏览器:
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
但我无法弄清楚如何指定内容窗口的滚动条并滚动到其底部。
答案 0 :(得分:1)
你只需将last_height放在javascript代码中就像这样:
while True:
# Scroll down to 'last_height'
driver.execute_script("window.scrollTo(0, {});".format(last_height))
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight;")
if new_height == last_height:
break
last_height = new_height
另一种解决方法是简单地将数据拉出来而不用硒。如果您查看网页制作的调用(Chrome检查器,网络标签),您会看到每个新元素都使用小块xml加载到网站中。
起点的网址是' https://www.elegislation.gov.hk/xml?skipHSC=true&LANGUAGE=E&BILINGUAL=&LEG_PROV_MASTER_ID=181740&QUERY=.&INDEX_CS=N&PUBLISHED=true'
对于网站加载的每个块,PROV_MASTER_ID参数将增加1。
你可以使用这样的请求抓住它:
import requests
url = 'https://www.elegislation.gov.hk/xml?skipHSC=true&LANGUAGE=E&BILINGUAL=&LEG_PROV_MASTER_ID={}&QUERY=.&INDEX_CS=N&PUBLISHED=true'
starting_count = 181740
stop_count = "" # integer - you need to figure out, when you got all you need
count = starting_count
while count <= stop_count:
response = requests.get(url.format(count))
# parse the xml and grab the parts you need...
count +=1