我正在尝试从this网页上抓取数据。这是一项艰巨的工作-共有约600个下一页链接,并且由于“对等连接重置”错误,刮板在300页(8小时)后发生故障。
我希望刮板不要从每次崩溃后都重新开始,而是从第300页开始,希望它能无错误地到达最后一页,并且我可以附加两个输出的数据文件。我的代码(如下)在我连续进行分页(从第1页开始)时起作用,但是如果打开第1页并尝试将其发布到第300页,则它不起作用。我收到有关page_no变量“ AttributeError:'NoneType'的错误对象没有属性'find'“,这意味着它从未到达页面300。任何想法是什么问题以及如何解决这个问题?
#Open Search Page
url = 'http://forestsclearance.nic.in/'
r = requests.get(url + 'Online_Status.aspx')
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
cookies = {
'ASP.NET_SessionId': 'kaqs1jzegnfn4zxpwio4jthl',
'countrytabs': '0',
'countrytabs1': '0',
'acopendivids': 'Omfc,Email,Campa,support,livestat,commitee,Links',
'acgroupswithpersist': 'nada',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
#Click Search box
r = requests.post(
url + 'Online_Status.aspx',
headers=headers,
cookies=cookies,
data = {
'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': GENERATOR,
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': VALIDATION,
'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
'ctl00$ContentPlaceHolder1$ddl1': 'Select',
'ctl00$ContentPlaceHolder1$ddl3': 'Select',
'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
'ctl00$ContentPlaceHolder1$txtsearch': '',
'ctl00$ContentPlaceHolder1$HiddenField1': '',
'ctl00$ContentPlaceHolder1$HiddenField2': '',
'__ASYNCPOST': 'false',
'ctl00$ContentPlaceHolder1$Button1': 'SEARCH',
}
)
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
#Post to Page 300
lastPage = 563
for page in range(300, lastPage + 1):
r = requests.post(
url + 'Online_Status.aspx',
cookies=cookies,
data = {
'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
'ctl00$ContentPlaceHolder1$RadioButtonList1': 'New',
'__EVENTARGUMENT': 'Page${}'.format(page),
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$grdevents',
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': GENERATOR,
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': VALIDATION,
'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
'ctl00$ContentPlaceHolder1$ddl1': 'Select',
'ctl00$ContentPlaceHolder1$ddl3': 'Select',
'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
'ctl00$ContentPlaceHolder1$txtsearch': '',
'ctl00$ContentPlaceHolder1$HiddenField1': '',
'ctl00$ContentPlaceHolder1$HiddenField2': '',
'__ASYNCPOST': 'false',
}
)
#scrape data
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', {'id' : 'ctl00_ContentPlaceHolder1_grdevents'})
page_no = int(table.find('tr', {'class': 'pagi'}).span.text)
rows = table.findAll('tr')
for row in rows[1:len(rows)-2]:
#My scraping code goes here...
#Get form data for next page post request
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
答案 0 :(得分:0)
听起来像table
(可能是soup
,也许是r.content
)不是您期望的对象(空吗?)。您是否尝试过转储该请求的响应头?如果您可以在加载print(r.headers)
变量之前向我们展示soup
的结果,我们也许可以帮助您进一步挖掘。