使用python到asp.net页面发布请求

时间:2013-02-07 08:32:08

标签: python http-post web-scraping

我想从“http://www.indiapost.gov.in/pin/”中删除PINCODE,我正在编写以下代码。

import urllib
import urllib2
headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Origin': 'http://www.indiapost.gov.in',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://www.indiapost.gov.in/pin/',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
viewstate = 'JulXDv576ZUXoVOwThQQj4bDuseXWDCZMP0tt+HYkdHOVPbx++G8yMISvTybsnQlNN76EX/...'
eventvalidation = '8xJw9GG8LMh6A/b6/jOWr970cQCHEj95/6ezvXAqkQ/C1At06MdFIy7+iyzh7813e1/3Elx...'
url = 'http://www.indiapost.gov.in/pin/'
formData = (
    ('__EVENTVALIDATION', eventvalidation),
    ('__EVENTTARGET',''),
    ('__EVENTARGUMENT',''),
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEENCRYPTED',''),
    ('__EVENTVALIDATION', eventvalidation),
    ('txt_offname',''),
    ('ddl_dist','0'),
    ('txt_dist_on',''),
    ('ddl_state','2'),
    ('btn_state','Search'),
    ('txt_stateon',''),
    ('hdn_tabchoice','3')
)


from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()


encodedFields = urllib.urlencode(formData)

f = myopener.open(url, encodedFields)
print f.info()

try:
fout = open('tmp.txt', 'w')
except:
print('Could not open output file\n')

fout.writelines(f.readlines())
fout.close()

我收到服务器的回复“抱歉这个网站遇到了严重的问题,请尝试重新加载页面或联系网站管理员。” pl建议我哪里出错..

1 个答案:

答案 0 :(得分:19)

您从哪里获得了值viewstateeventvalidation?一方面,它们不应该以“......”结尾,你必须省略一些东西。另一方面,它们不应该是硬编码的。

一种解决方案是这样的:

  1. 通过网址“http://www.indiapost.gov.in/pin/”检索网页,无需任何表单数据
  2. 解析并检索表单值,例如__VIEWSTATE__EVENTVALIDATION(您可以使用BeautifulSoup)。
  3. 通过添加第2步中的重要表单数据来获取搜索结果(第二个HTTP请求)。
  4. <强>更新

    根据上述想法,我会略微修改您的代码以使其正常工作:

    import urllib
    from bs4 import BeautifulSoup
    
    headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Origin': 'http://www.indiapost.gov.in',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Referer': 'http://www.indiapost.gov.in/pin/',
        'Accept-Encoding': 'gzip,deflate,sdch',
        'Accept-Language': 'en-US,en;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
    }
    
    class MyOpener(urllib.FancyURLopener):
        version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
    
    myopener = MyOpener()
    url = 'http://www.indiapost.gov.in/pin/'
    # first HTTP request without form data
    f = myopener.open(url)
    soup = BeautifulSoup(f)
    # parse and retrieve two vital form values
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
    eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
    
    formData = (
        ('__EVENTVALIDATION', eventvalidation),
        ('__VIEWSTATE', viewstate),
        ('__VIEWSTATEENCRYPTED',''),
        ('txt_offname', ''),
        ('ddl_dist', '0'),
        ('txt_dist_on', ''),
        ('ddl_state','1'),
        ('btn_state', 'Search'),
        ('txt_stateon', ''),
        ('hdn_tabchoice', '1'),
        ('search_on', 'Search'),
    )
    
    encodedFields = urllib.urlencode(formData)
    # second HTTP request with form data
    f = myopener.open(url, encodedFields)
    
    try:
        # actually we'd better use BeautifulSoup once again to
        # retrieve results(instead of writing out the whole HTML file)
        # Besides, since the result is split into multipages,
        # we need send more HTTP requests
        fout = open('tmp.html', 'w')
    except:
        print('Could not open output file\n')
    fout.writelines(f.readlines())
    fout.close()