使用Mechanize Python隐藏HTML元素

时间:2015-03-07 16:46:29

标签: python html web-scraping mechanize hidden

所以我正在编写一个Python脚本来检查Blackboard(学校界面网站)的更新。但是我在脚本中收到的HTML与在浏览器中查看时的HTML并不完全相同。我不确定这是一个cookie问题还是我错过的。

USERNAME = ''
PASSWORD = ''

updates = 0  
site = 'http://schoolsite.edu'

browser = mechanize.Browser()
browser.open(site)
browser.select_form(nr = 0)
browser.form['j_username'] = USERNAME
browser.form['j_password'] = PASSWORD
browser.submit()

#it brings back an empty form, just submit it.
browser.select_form(nr = 0)
browser.submit()

html_resp = browser.response().read()

有问题的HTML看起来像这样(这是来自脚本)

<span id="badgeTotal" style="visibility: hidden" title="">
<span class="hideoff" id="badgeAXLabel">Activity Updates</span>
<span class="badge" id="badgeTotalCount" title=""></span>

我期待它的样子(来自Chrome /实际浏览器)

<span id="badgeTotal" style="visibility: visible;" title="">
<span class="hideoff" id="badgeAXLabel">Activity Updates</span>
<span class="badge" id="badgeTotalCount" title="">1</span>

我真正追求的是最后一行中的'1'数字,但我觉得可见性属性正在阻止它。请注意,我从Mechanize获得了与浏览器相同的cookie。 (不完全相同,但同名,姓名等)

有什么想法吗?

赞赏任何意见。

1 个答案:

答案 0 :(得分:0)

非常确定 javascript涉及 which mechanize cannot handle

此处的另一种解决方案是通过selenium自动化真实的浏览器:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()  # could also be headless: webdriver.PhantomJS()
driver.get('http://schoolsite.edu')

# submit a login form
username = driver.find_element_by_name('j_username')
password = driver.find_element_by_name('j_password')

username.send_keys(USERNAME)
password.send_keys(PASSWORD)

username.submit()

# wait for the badge count to appear
badge_count = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "badgeTotalCount")))

print(badge_count.text)