无法抓取遍历多个页面的数据

时间:2017-12-13 15:35:51

标签: python python-3.x web-scraping http-post

我在python中编写了一个脚本来从网页获取数据。该网站在60页内显示其内容。我的刮刀可以从它的第二页解析数据。当我尝试更改payload参数中的页码或创建循环以从少数页面获取数据时,它会立即中断。如何以这种方式纠正我的脚本,以便它可以从所有页面获取数据,而不仅仅是从第二页。提前谢谢。

  1. 使用数据到达网站的链接:Page_link
  2. 使用以下脚本替换以下链接:page_url
  3. 我想,分页编号就在这里:

    ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages:1
    

    这是完整的脚本(仅适用于第2页):

    import requests
    from bs4 import BeautifulSoup
    
    url = "Link to replace with the above url" ##Replace the number 2 links here
    
    formdata = {
        'searchEntity':'FundServiceProvider',
        'searchType':'Name',
        'searchText':'',
        'registers':'6,29,44,45',
        'AspxAutoDetectCookieSupport':'1'
    }
    req = requests.get(url,params=formdata,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(req.text,"lxml")
    
    VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
    EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']
    
    payload = {
        '__EVENTTARGET':'','__EVENTARGUMENT':'','__LASTFOCUS':'','__VIEWSTATE':VIEWSTATE,'__SCROLLPOSITIONX':'0','__SCROLLPOSITIONY':'541','__EVENTVALIDATION':EVENTVALIDATION,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260','ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
    }
    
    with requests.session() as session:
        session.headers = {"User-Agent":"Mozilla/5.0"}
        response = session.post(req.url,data=payload)
        soup = BeautifulSoup(response.text,"lxml")
        tabd = soup.select(".searchresults")[0]
        for items in tabd.select("tr")[:-1]:
            data = ' '.join([item.text for item in items.select("th,td")])
            print(data)
    

1 个答案:

答案 0 :(得分:1)

您只需要删除有效负载数据的最后两个字段:

payload = {
    '__EVENTTARGET':'',
    '__EVENTARGUMENT':'',
    '__LASTFOCUS':'',
    '__VIEWSTATE':VIEWSTATE,
    '__SCROLLPOSITIONX':'0',
    '__SCROLLPOSITIONY':'541',
    '__EVENTVALIDATION':EVENTVALIDATION,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1
}

而不是

payload = {
    '__EVENTTARGET':'',
    '__EVENTARGUMENT':'',
    '__LASTFOCUS':'',
    '__VIEWSTATE':VIEWSTATE,
    '__SCROLLPOSITIONX':'0',
    '__SCROLLPOSITIONY':'541',
    '__EVENTVALIDATION':EVENTVALIDATION,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260',
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
}

然后更新ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages值将获得正确的页面数据