我正在尝试使用Python机制检查考试的日期/时间,如果结果中有特定的日期/时间,则向某人发送电子邮件(结果页面截图附件)
import mechanize
from BeautifulSoup import BeautifulSoup
URL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
br = mechanize.Browser()
response = br.open(URL)
# there are some errors in doctype and hence filtering the page content a bit
response.set_data(response.get_data()[200:])
br.set_response(response)
br.select_form(name="entry_form")
# select Oakland for the 1st set of checkboxes
for i in range(0, len(br.find_control(type="checkbox",name="cb_examSites").items)):
if i ==2:
br.find_control(type="checkbox",name="cb_examSites").items[i].selected =True
# select salesperson for the 2nd set of checkboxes
for i in range(0, len(br.find_control(type="checkbox",name="cb_examTypes").items)):
if i ==1:
br.find_control(type="checkbox",name="cb_examTypes").items[i].selected =True
reponse = br.submit()
print reponse.read()
我能够得到回复但由于某种原因我的表中的数据丢失了
这是初始html页面中的按钮
<input type="submit" value="Get Exam List" name="B1">
<input type="button" value="Clear" name="B2" onclick="clear_entries()">
<input type="hidden" name="action" value="GO">
实际数据所在的输出(提交响应)的一部分
<table summary="California Exams Scheduling" class="General_list" width="100%" cellspacing="0"> <EVERTHING INBETWEEN IS MISSING HERE>
</table>
缺少表格中的所有数据。我从chrome浏览器提供了表元素的屏幕截图。
有人可以告诉我如何从响应中获取日期/时间(假设我必须使用BeautifulSoup),因此必须在这些方面做点什么。我试图找出一个特定的日期我想到的(比如3月8日)在回复中是否显示了下午1:30的开始时间..附有截屏
soup = BeautifulSoup(response.read()) print soup.find(name =&#34; table&#34;)
更新 - 看起来我的问题可能与here有关,我正在尝试我的选项。我根据其中一个答案尝试了下面的内容,但是在数据中看不到任何tr元素(虽然我手动检查时可以在页面源中看到这个)
soup.findAll('table')[0].findAll('tr')
更新 - 修改此选项以使用selenium,将尽快在某个时间点继续进行
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import urllib3
myURL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
browser = webdriver.Firefox() # Get local session of firefox
browser.get(myURL) # Load page
element = browser.find_element_by_id("Checkbox5")
element.click()
element = browser.find_element_by_id("Checkbox13")
element.click()
element = browser.find_element_by_name("B1")
element.click()
答案 0 :(得分:0)
5 年后,也许这可以帮助某人。我把你的问题当作训练练习。我使用 Requests 包完成了它。 (我使用python 3.9)
下面的代码分为两部分:
在 POST 请求后检索注入到表中的数据的请求。
## the request part
url = "https://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
params = {
"cb_examSites": [
"'Fresno'",
"'Los+Angeles'",
"'SF/Oakland'",
"'Sacramento'",
"'San+Diego'"
],
"cb_examTypes": [
"'Broker'",
"'Salesperson'"
],
"B1": "Get+Exam+List",
"action": "GO"
}
s = rq.Session()
r = s.get(url, headers=headers)
s.headers.update({"Cookie": "%s=%s" % (r.cookies.keys()[0], r.cookies.values()[0])})
r2 = s.post(url=url, data=params)
soup = bs(r2.content, "lxml") # contain data you want
解析响应(我的很多方法可能有点乏味)
table = soup.find_all("table", class_="General_list")[0]
titles = [el.text for el in table.find_all("strong")]
def beetweenBr(soupx):
final_str = []
for br in soupx.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
final_str.append(next_s.strip())
return "\n".join(final_str)
d = {}
trs = table.find_all("tr")
for tr in trs:
tr_text = tr.text
if tr_text in titles:
curr_title = tr_text
splitx = curr_title.split(" - ")
area, job = splitx[0].split(" ")[0], splitx[1].split(" ")[0]
if not job in d.keys():
d[job] = {}
if not area in d[job].keys():
d[job][area] = []
continue
if (not tr_text in titles) & (tr_text != "DateBegin TimeLocationScheduledCapacity"):
tds = tr.find_all("td")
sub = []
for itd, td in enumerate(tds):
if itd == 2:
sub.append(beetweenBr(td))
else :
sub.append(td.text)
d[job][area].append(sub)
“d”包含您想要的数据。我还没有发送电子邮件。