python mechanize检查网站上的考试日期/时间

时间:2016-03-06 13:12:58

标签: python web-scraping mechanize

我正在尝试使用Python机制检查考试的日期/时间,如果结果中有特定的日期/时间,则向某人发送电子邮件(结果页面截图附件)

import mechanize
from BeautifulSoup import BeautifulSoup
URL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"


br = mechanize.Browser()
response = br.open(URL)


# there are some errors in doctype and hence filtering the page content a bit
response.set_data(response.get_data()[200:])

br.set_response(response)
br.select_form(name="entry_form")

# select Oakland for the 1st set of checkboxes

for i in range(0,     len(br.find_control(type="checkbox",name="cb_examSites").items)):
    if i ==2:
        br.find_control(type="checkbox",name="cb_examSites").items[i].selected =True

# select salesperson for the 2nd set of checkboxes

for i in range(0, len(br.find_control(type="checkbox",name="cb_examTypes").items)):
    if i ==1:
        br.find_control(type="checkbox",name="cb_examTypes").items[i].selected =True

reponse = br.submit()
print  reponse.read()

我能够得到回复但由于某种原因我的表中的数据丢失了

这是初始html页面中的按钮

<input type="submit" value="Get Exam List" name="B1">
<input type="button" value="Clear" name="B2" onclick="clear_entries()">
<input type="hidden" name="action" value="GO">

实际数据所在的输出(提交响应)的一部分

<table summary="California Exams Scheduling" class="General_list" width="100%" cellspacing="0"> <EVERTHING INBETWEEN IS MISSING HERE>
</table>

缺少表格中的所有数据。我从chrome浏览器提供了表元素的屏幕截图。

  1. 有人可以告诉我可能出现的问题吗?
  2. 有人可以告诉我如何从响应中获取日期/时间(假设我必须使用BeautifulSoup),因此必须在这些方面做点什么。我试图找出一个特定的日期我想到的(比如3月8日)在回复中是否显示了下午1:30的开始时间..附有截屏

    soup = BeautifulSoup(response.read()) print soup.find(name =&#34; table&#34;)

  3. 更新 - 看起来我的问题可能与here有关,我正在尝试我的选项。我根据其中一个答案尝试了下面的内容,但是在数据中看不到任何tr元素(虽然我手动检查时可以在页面源中看到这个)

    soup.findAll('table')[0].findAll('tr') 
    

    this question

    更新 - 修改此选项以使用selenium,将尽快在某个时间点继续进行

    from selenium.common.exceptions import NoSuchElementException
    from selenium.webdriver.common.keys import Keys
    from bs4 import BeautifulSoup
    import urllib3
    
    
    myURL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
    browser = webdriver.Firefox() # Get local session of firefox
    browser.get(myURL) # Load page
    
    element = browser.find_element_by_id("Checkbox5")
    element.click()
    
    
    element = browser.find_element_by_id("Checkbox13")
    element.click()
    
    element = browser.find_element_by_name("B1")
    element.click()
    

1 个答案:

答案 0 :(得分:0)

5 年后,也许这可以帮助某人。我把你的问题当作训练练习。我使用 Requests 包完成了它。 (我使用python 3.9)

下面的代码分为两部分:

  • 在 POST 请求后检索注入到表中的数据的请求。

    ## the request part
    
    url = "https://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
    
    params = {
    "cb_examSites": [
        "'Fresno'",
        "'Los+Angeles'",
        "'SF/Oakland'",
        "'Sacramento'",
        "'San+Diego'"
    ],
    "cb_examTypes": [
        "'Broker'",
        "'Salesperson'"
    ],
    "B1": "Get+Exam+List",
    "action": "GO"
    }
    
    s = rq.Session()
    r = s.get(url, headers=headers)
    s.headers.update({"Cookie": "%s=%s" % (r.cookies.keys()[0], r.cookies.values()[0])})
    r2 = s.post(url=url, data=params)
    soup = bs(r2.content, "lxml") # contain data you want
    
  • 解析响应(我的很多方法可能有点乏味)

    table = soup.find_all("table", class_="General_list")[0]
    
    titles = [el.text for el in table.find_all("strong")]
    
    def beetweenBr(soupx):
        final_str = []
        for br in soupx.findAll('br'):
        next_s = br.nextSibling
        if not (next_s and isinstance(next_s,NavigableString)):
            continue
        next2_s = next_s.nextSibling
        if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
            text = str(next_s).strip()
            if text:
                final_str.append(next_s.strip())
     return "\n".join(final_str)
    
     d = {}
     trs = table.find_all("tr")
     for tr in trs:
          tr_text = tr.text
          if tr_text in titles:
               curr_title = tr_text
               splitx = curr_title.split(" - ")
               area, job = splitx[0].split(" ")[0], splitx[1].split(" ")[0]
          if not job in d.keys():
               d[job] = {}
          if not area in d[job].keys():
               d[job][area] = []
          continue
          if (not tr_text in titles) & (tr_text != "DateBegin TimeLocationScheduledCapacity"):
               tds = tr.find_all("td")
               sub = []
               for itd, td in enumerate(tds):
                   if itd == 2:
                       sub.append(beetweenBr(td))
                   else :
                       sub.append(td.text)
              d[job][area].append(sub)
    

“d”包含您想要的数据。我还没有发送电子邮件。