无法从Aspx页面使用python请求抓取数据

时间:2020-11-11 12:06:08

标签: python web-scraping python-requests

我正在尝试从多页表中抓取数据,该表在填写表单后返回。 原始表格的网址为https://ndber.seai.ie/Pass/assessors/search.aspx

在Firefox的开发人员模式下,我看到了初始POST请求,然后复制了在“请求”部分中找到的所有参数

但是,我只得到“ 404-找不到文件或目录”。结果。

我怀疑该网站有一些cookie或sumtin试图防止抓取。现在这是一个政府站点,并且数据是公开的,因此,如果是这样的话,我发现它有点丰富。

有人知道如何处理案件吗?

谢谢/科尔姆

P.S。 500个以上的结果每页仅显示20个,因此我将寻找一种聪明的方法来迭代所有这些结果:)


import requests

url = "https://ndber.seai.ie/Pass/assessors/search.aspx"

# Post specific day to get one day of data
params ={
'__EVENTTARGET':"",
'__EVENTARGUMENT':"",
'__VIEWSTATE':"/wEPDwULLTEzODkxMTYxMTkPFgIeE1ZhbGlkYXRlUmVxdWVzdE1vZGUCARYCZg9kFgQCAQ9kFgICAg8VAVdodHRwczovL3d3dy5nb29nbGUuY29tL3JlY2FwdGNoYS9hcGkuanM/cmVuZGVyPTZMZGhSS1lVQUFBQUFDNUJ0WlB3eG9VMlNoMzFNTm5mVjQ4VTBVOVBkAgMPZBYCAgUPFgIeBWNsYXNzBRBtYWlud3JhcHBlcl93aWRlFgQCAw8PFgIeB1Zpc2libGVnZGQCCQ9kFgICAQ9kFgQCAQ9kFgJmD2QWAmYPZBYCAgMPZBYEAgIPZBYEZg9kFgICAQ9kFgYCAQ9kFgICAw8PFggeBF8hU0ICgAIeDERlZmF1bHRXaWR0aBweB1Rvb2xUaXAFMFBsZWFzZSBlbnRlciBhIHZhbHVlLCB3aXRoIG5vIHNwZWNpYWwgY2hhcmFjdGVycx4FV2lkdGgcZGQCAw9kFgICAw8PFggfAwKAAh8EHB8…RBc3Nlc3NvclNlYXJjaCRkZlNlYXJjaCRyYm5Eb21lc3RpYwU7Y3RsMDAkRGVmYXVsdENvbnRlbnQkQXNzZXNzb3JTZWFyY2gkZGZTZWFyY2gkcmJuTm9uRG9tZXN0aWMFO2N0bDAwJERlZmF1bHRDb250ZW50JEFzc2Vzc29yU2VhcmNoJGRmU2VhcmNoJHJibk5vbkRvbWVzdGljBT5jdGwwMCREZWZhdWx0Q29udGVudCRBc3Nlc3NvclNlYXJjaCRkZlNlYXJjaCRyYm5Ob25Eb21lc3RpY0RFQwU+Y3RsMDAkRGVmYXVsdENvbnRlbnQkQXNzZXNzb3JTZWFyY2gkZGZTZWFyY2gkcmJuTm9uRG9tZXN0aWNERUMFOmN0bDAwJERlZmF1bHRDb250ZW50JEFzc2Vzc29yU2VhcmNoJGdyaWRBc3Nlc3NvcnMkZ3JpZHZpZXcPZ2TpUoPOVLdrb5Z2O/NSxb3UXyuwZrgyqPCxRSKHOHnYMg==",
'__VIEWSTATEGENERATOR':"6C6E51AB",
'__SCROLLPOSITIONX':"0",
'__SCROLLPOSITIONY':"0",
'__EVENTVALIDATION':"/wEdAERKtx3/hBLIisMyUzU0gb3j0pNiMzuBxMlBbOZl58cjL/lCQKrAXJj4Xb/Exce8xi+xOvKaJ6/b1rTEwaFlgHJAE2XyeqNwAVjQZeSPgdeviwJw0fFtJVqRsSk/zHXDOJQtWTjfzlbENZrfGaW/s5Xm6UKt/ITihi09uczUBOJBEsVIDyhg2Ei3RnGYYRhxfdDiOnWXkFrnF9kusJDFkPS9rpNU/h0IGC12noA3ikLz5omFgTDYbAK32EA0KW1BJiH7AUw6ATYGabP0MqR3Jnw87IZK3/DdAj9JYkbpGuBplCqDShm0HH+Z69bSLFKtdJQpYQuvByz9yr0LBEC8C/+8NDodaKGk8iquSHdzOJr5cTabiIHJ9Ogb/WNkNcDNn9uvZS3jIBEii3/Pb6usmmkLhL4CFvYiwSh45QAMiAIqVp1HGbOdiKjXskShNeD/DQhu+Tf1LxB4hKkF7/kxuy/d8oEhczMZKLl26H4tTsRo61CO3QOJtjFctSe…sS6PcL5twwT6Qrwtiz1WYA4FEgXx/8VnbV8amTYPnlqT07jcD3nM7ixfcnefSJRvFRz9lwAFW9BDRupd7/UhfI4axjCFiJMXXn1zn9arpjbwLHf0NN52JRhlt6RnrGWg1T4R9mTco0dkgwPtDnMUkoI6BFWymlS4FdE/xPik2Wefr88SHsClh8EymeRKQMrzbwbOfaOn/RD6fDXlkXUcscWYB1uE8BXq735HuMo8GLvXT6l+e8ZyhmI9Lp9e7dMd7aERFTmGVLNY7yRYOSx9baIcM2fLZlEIm3bdGKmgH+IXssXXoTcUMaHgolEThw+hQm3PhCVtzzlRy51E8YLb0MYxXu9Jl+0OTCCP1mYBAyr7VE9Hvwu5zdYwg2U1yIA9b3L8AzUXiHD/1pZ6bMckGCkRMe+fR8D6doZfpoaP5HcOP9HafNogqDfKFcut5NbuB5ii4App8JRqdNwxT6eRha4BhsPxU6MQQukL12Z3solCReieIZ1T4J62Gxrkrs",
'ctl00$forgeryToken':"044e8950-e5de-44b8-8c96-f36c9ec9b826",
'ctl00$DefaultContent$AssessorSearch$captcha':"03AGdBq25L3Gaa37jV67YlYKWx3PTHujx1AC89JV4eA5vn_qV6hkmU3RXpjWktUBqZBJTRkxzgPDwatzCd0pbxzB8z9VFnhAvR7Dd068h6iJNO932JI98_hcb5fXU2Fq85YZHJ3LOhp_Ql7DSAX27DRFMwjRAPAad2AGe-xNU1wWh9yDPEpd9jViWJJffXQBwb-JTeDw5OKuYIgRv5-EBfs_-9mceFO32q2btFdNRKV-pVepzd5S3iWgwv3fy8BDQXEseKd1OVQ1Fs-J0VtyjkmTFZb5svozw1ftar423W0EP9JGIUvhqyjE8xzmftSIzHaB_S4-iypPbYWZq7gRzxA4yDlhDCHX1BdvU4fRJi0cthHmTPm4eyLjeMmD5s8x8ai3GTjfhFYztPOm1pHJzkbsJfHnIpBG43lWYagN2dvlNymWbD9tmSUcQ_04twdIoCaO_sYnm1jS1G",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Name':"",
'ctl00$DefaultContent$AssessorSearch$dfSearch$CompanyName':"",
'ctl00$DefaultContent$AssessorSearch$dfSearch$County':"3",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Areas$ctl02$0':"on",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Areas$ctl02$1':"on",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Areas$ctl02$2':"on",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Areas$ctl02$3':"on",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Areas$ctl02$4':"on",
'ctl00$DefaultContent$AssessorSearch$dfSearch$searchType':"rbnDomestic",
'ctl00$DefaultContent$AssessorSearch$dfSearch$Bottomsearch':"Search"
} 
response = requests.post(url,data=params)
content = response.content
print(content)

1 个答案:

答案 0 :(得分:1)

您可以使用请求模块从该网站获取遍历多个页面的表格内容。在这种情况下,您必须使用适当的参数发送多个 post 请求才能访问内容。

与其他参数不同,有一个键 ctl00$DefaultContent$AssessorSearch$captcha,其值是动态生成的,不会出现在页面源中。

但是,您仍然可以使用这个requests_html 库获取该键的值。仅供参考,requestsrequests_html 库的作者相同。您只需要使用此函数 get_captcha_value() 一次即可获取验证码的值,然后您可以重复使用相同的值直到结束。

下面的脚本当前从所有页面中获取所有名称。您可以修改选择器以获取您感兴趣的其他字段。

你可以这样做:

import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession

link = 'https://ndber.seai.ie/Pass/assessors/search.aspx'

payload = {
    'ctl00$DefaultContent$AssessorSearch$dfSearch$Name': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$CompanyName': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$County': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$searchType': 'rbnDomestic',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$Bottomsearch': 'Search'
}

page = 1

def get_captcha_value():
    with HTMLSession() as session:
        r = session.get(link)
        r.html.render(sleep=5)
        captcha_value = r.html.find("input[name$='$AssessorSearch$captcha']",first=True).attrs['value']
        return captcha_value

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (WindowMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one("#__VIEWSTATE")['value']
    payload['__VIEWSTATEGENERATOR'] = soup.select_one("#__VIEWSTATEGENERATOR")['value']
    payload['__EVENTVALIDATION'] = soup.select_one("#__EVENTVALIDATION")['value']
    payload['ctl00$forgeryToken'] = soup.select_one("#ctl00_forgeryToken")['value']
    payload['ctl00$DefaultContent$AssessorSearch$captcha'] = get_captcha_value()
    
    while True:
        res = s.post(link,data=payload)
        soup = BeautifulSoup(res.text,"lxml")
        if not soup.select_one("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"): break
        for items in soup.select("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"):
            _name = items.select_one("td > span").get_text(strip=True)
            print(_name)

        page+=1
        payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
        payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Feedback')
        payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Search')
        payload['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
        payload['__EVENTARGUMENT'] = f'1${page}'