Python Scraping .aspx表单

时间:2014-12-01 14:45:42

标签: python asp.net

我是python的新手,尝试通过.aspx表单进行一些搜索。当我执行此代码时,出现错误。我正在使用Python 3.4.2。

 import urllib
 from bs4 import BeautifulSoup
 import urllib.request
 from urllib.request import urlopen

 headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
 }

 class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'


 myopener = MyOpener()
 url = 'http://legistar.council.nyc.gov/Legislation.aspx'
 # first HTTP request without form data
 f = myopener.open(url)
 soup = BeautifulSoup(f)

 #vstate = soup.select("#__VSTATE")[0]['value']
 viewstate = soup.select("#__VIEWSTATE")[0]['value']
 eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

 formFields = (
    (r'__VSTATE', r''),
    (r'__VIEWSTATE', viewstate),
    (r'__EVENTVALIDATION', eventvalidation),
    (r'ctl00_RadScriptManager1_HiddenField', ''), 
    (r'ctl00_tabTop_ClientState', ''), 
    (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
    (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
    (r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'),  # file number
    (r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'),  # Legislative text
    (r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'),  # attachement
    (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
    (r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'),  # Years to include
    (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
    (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
 )

encodedFields = urllib.parse.urlencode(formFields)
# second HTTP request with form data
f = myopener.open(url, encodedFields)

try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'wb')
 except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()

此脚本不返回任何结果。

如何让脚本搜索表单并返回结果?

1 个答案:

答案 0 :(得分:0)

正如Andrei在评论中提到的那样,您将需要导入urllib,但是您可能会遇到其他代码问题,因为您正在对__VIEWSTATE__EVENTVALIDATION进行硬编码。

Hui Zheng做了很好的解释,这让我弄清楚了,所以我只是link to his answer而不是试着解释它。