Python .aspx搜索表单结果问题

时间:2014-12-02 17:49:34

标签: python python-3.x web-scraping beautifulsoup web-crawler

我是python的新手。我试图建立一个可以在使用aspx搜索表单的网站上执行搜索的机器人,我试图搜索表单然后将结果保存到文件。

这是我的剧本:

 import urllib
 from bs4 import BeautifulSoup
 import urllib.request
 from urllib.request import urlopen


 headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
 }

 class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

 myopener = MyOpener()

 url = 'http://legistar.council.nyc.gov/Legislation.aspx'
 # first HTTP request without form data
 f = myopener.open(url)
 soup = BeautifulSoup(f)

 lastfocus = soup.select("#__LASTFOCUS")[0]['value']
 eventtarget = soup.select("#__EVENTTARGET")[0]['value']
 eventargument = soup.select("#__EVENTARGUMENT")[0]['value']
 viewstate = soup.select("#__VIEWSTATE")[0]['value']

 formFields = (
    (r'__LASTFOCUS', lastfocus),
    (r'__EVENTTARGET', eventtarget),
    (r'__EVENTARGUMENT', eventargument),
    (r'__VIEWSTATE', viewstate),
    (r'ctl00_RadScriptManager1_TSM', ''),
    (r'ctl00_tabTop_ClientState', ''),
    (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
    (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
                                                   # Check boxes
    (r'ctl00$ContentPlaceHolder1$chkID', 'on'),  # file number
    (r'ctl00$ContentPlaceHolder1$chkText', 'on'),  # Legislative text
    (r'ctl00$ContentPlaceHolder1$chkAttachments', 'on'),  # attachement
                                                   # etc. (not all listed)
    (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
    (r'ctl00$ContentPlaceHolder1$lstYears', '2014'),  # Years to include
    (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
    (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
 )

 encodedFields = urllib.parse.urlencode(formFields)
 # second HTTP request with form data
 f = myopener.open(url, encodedFields)

 try:
     # actually we'd better use BeautifulSoup once again to
     # retrieve results(instead of writing out the whole HTML file)
     # Besides, since the result is split into multipages,
     # we need send more HTTP requests
     fout = open('tmp.html', 'wb')
 except:
     print('Could not open output file\n')
 fout.writelines(f.readlines())
 fout.close()

执行没有任何错误。但是当我打开tmp.html文件时,我看不到实际网站上显示的结果。

结果如下:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org     /TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml">
 <head><title>
    Error
 </title></head>
 <body>
<form name="form1" method="post" action="Error.aspx" id="form1">
 <div>
 <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE"      value="ND1u0lOZH65sNTWWoa6wLYsEtU6yeI938ytDgbd2dC167Gk8a/1RonXoednpTu74caJ8DocoE4ewDkNe6u02VlFhiTlr5MevcRRE7CVvClRleCWGYiPME3cqJWvjA8uv" />
 </div>

 <div>

    <input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="AB827D4F" />
 </div>
     <div>
         <h2>
        Server Error</h2>
         <h4>
             The server encountered a temporary error and could not complete your      request.</h4>
         <h4>
             Please <a href="Default.aspx">try again</a> in 30 seconds.</h4>
     </div>
     </form>
 </body>
 </html>

如何让脚本返回我要查找的结果?

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

此代码完美无缺。

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://legistar.council.nyc.gov/Legislation.aspx")
# Alternatively, link directly to the form:
# driver.get("https://www.icsi.in/student/Members/MemberSearch.aspx?SkinSrc=%5BG%5DSkins/IcsiTheme/IcsiIn-Bare&ContainerSrc=%5BG%5DContainers/IcsiTheme/NoContainer")

# Locate the elements.
first = driver.find_element_by_id("ctl00_ContentPlaceHolder1_txtSearch")
search = driver.find_element_by_id("ctl00_ContentPlaceHolder1_btnSearch")

# Input the data and click submit.
first.send_keys("York")
search.click()