从特定的ASPP页面启动Web抓取工具

时间:2018-11-12 20:54:56

标签: python asp.net web-scraping python-requests

我正在尝试从this网页上抓取数据。这是一项艰巨的工作-共有约600个下一页链接,并且由于“对等连接重置”错误,刮​​板在300页(8小时)后发生故障。

我希望刮板不要从每次崩溃后都重新开始,而是从第300页开始,希望它能无错误地到达最后一页,并且我可以附加两个输出的数据文件。我的代码(如下)在我连续进行分页(从第1页开始)时起作用,但是如果打开第1页并尝试将其发布到第300页,则它不起作用。我收到有关page_no变量“ AttributeError:'NoneType'的错误对象没有属性'find'“,这意味着它从未到达页面300。任何想法是什么问题以及如何解决这个问题?

#Open Search Page
url = 'http://forestsclearance.nic.in/'            
r = requests.get(url + 'Online_Status.aspx')
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
cookies = {
    'ASP.NET_SessionId': 'kaqs1jzegnfn4zxpwio4jthl',
    'countrytabs': '0',
    'countrytabs1': '0',
    'acopendivids': 'Omfc,Email,Campa,support,livestat,commitee,Links',
    'acgroupswithpersist': 'nada',
}
headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
#Click Search box
r = requests.post(
    url + 'Online_Status.aspx',
    headers=headers,
    cookies=cookies,
    data = {
        'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
        '__EVENTARGUMENT': '',
        '__EVENTTARGET': '',
        '__VIEWSTATE': VIEWSTATE,
        '__VIEWSTATEGENERATOR': GENERATOR,
        '__VIEWSTATEENCRYPTED': '',
        '__EVENTVALIDATION': VALIDATION,
        'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
        'ctl00$ContentPlaceHolder1$ddl1': 'Select',
        'ctl00$ContentPlaceHolder1$ddl3': 'Select',
        'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
        'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
        'ctl00$ContentPlaceHolder1$txtsearch': '',
        'ctl00$ContentPlaceHolder1$HiddenField1': '',
        'ctl00$ContentPlaceHolder1$HiddenField2': '',
        '__ASYNCPOST': 'false',
        'ctl00$ContentPlaceHolder1$Button1': 'SEARCH',
    }
)
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)

#Post to Page 300
lastPage = 563
for page in range(300, lastPage + 1):
    r = requests.post(
    url + 'Online_Status.aspx',
    cookies=cookies,
    data = {
    'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
    'ctl00$ContentPlaceHolder1$RadioButtonList1': 'New',
    '__EVENTARGUMENT': 'Page${}'.format(page),
    '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$grdevents',
    '__VIEWSTATE': VIEWSTATE,
    '__VIEWSTATEGENERATOR': GENERATOR,
    '__VIEWSTATEENCRYPTED': '',
    '__EVENTVALIDATION': VALIDATION,
    'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
    'ctl00$ContentPlaceHolder1$ddl1': 'Select',
    'ctl00$ContentPlaceHolder1$ddl3': 'Select',
    'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
    'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
    'ctl00$ContentPlaceHolder1$txtsearch': '',
    'ctl00$ContentPlaceHolder1$HiddenField1': '',
    'ctl00$ContentPlaceHolder1$HiddenField2': '',
    '__ASYNCPOST': 'false',
        }
    )

    #scrape data
    soup = BeautifulSoup(r.content, 'lxml')
    table = soup.find('table', {'id' : 'ctl00_ContentPlaceHolder1_grdevents'})
    page_no = int(table.find('tr', {'class': 'pagi'}).span.text)
    rows = table.findAll('tr')
    for row in rows[1:len(rows)-2]:
        #My scraping code goes here...

    #Get form data for next page post request
    VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)

1 个答案:

答案 0 :(得分:0)

听起来像table(可能是soup,也许是r.content)不是您期望的对象(空吗?)。您是否尝试过转储该请求的响应头?如果您可以在加载print(r.headers)变量之前向我们展示soup的结果,我们也许可以帮助您进一步挖掘。