使用python在url中使用id抓取ASP.NET站点

时间:2016-01-01 17:32:35

标签: python asp.net python-3.x web-scraping python-requests

我试图使用Python请求来发送POST请求来搜索此ASP.NET网站的搜索结果。即使我使用GET请求来获取requestverificationtoken并将其包含在我的标题中,我也得到这样的回复:

{"Token":"Y2VgsmEAAwA","Link":"/search/Y2VgsmEAAwA/"}

这不是有效的链接。它是我的POST请求中包含的没有定义的到达数据或区域的总搜索结果。我错过了什么?我该如何抓取这样的网站,为网址生成(会话?)ID?

非常感谢你们所有人!

我的python脚本:

import json
import requests
from bs4 import BeautifulSoup

r = requests.Session()

# GET request  
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']
#print(auth_string)
#print(gr.url)

# POST request
search_request = {
    "Geography.Geography":"Danmark",
    "Geography.GeographyLong=":"Danmark (Ferieområde)",
    "Geography.Id":"da509992-0830-44bd-869d-0270ba74ff62",
    "Geography.SuggestionId": "",
    "Period.Arrival":"16-1-2016",
    "Period.Duration":7,
    "Period.ArrivalCorrection":"false",
    "Price.MinPrice":None,
    "Price.MaxPrice":None,
    "Price.MinDiscountPercentage":None,
    "Accommodation.MinPersonNumber":None,
    "Accommodation.MinBedrooms":None,
    "Accommodation.NumberOfPets":None,
    "Accommodation.MaxDistanceWater":None,
    "Accommodation.MaxDistanceShopping":None,
    "Facilities.SwimmingPool":"false",
    "Facilities.Whirlpool":"false",
    "Facilities.Sauna":"false",
    "Facilities.InternetAccess":"false",
    "Facilities.SatelliteCableTV":"false",
    "Facilities.FireplaceStove":"false",
    "Facilities.Dishwasher":"false",
    "Facilities.WashingMachine":"false",
    "Facilities.TumblerDryer":"false",
    "update":"true"
    }


payload = { 
    "searchRequestJson": json.dumps(search_request),
    }


header ={
"Accept":"application/json, text/html, */*; q=0.01",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"da-DK,da;q=0.8,en-US;q=0.6,en;q=0.4",
"Connection":"keep-alive",
"Content-Length":"720",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
"Cookie":"ASP.NET_SessionId=ebkmy3bzorzm2145iwj3bxnq; __RequestVerificationToken=" + auth_string + "; aid=382a95aab250435192664e80f4d44e0f; cid=google-dk; popout=hidden; __utmt=1; __utma=1.637664197.1451565630.1451638089.1451643956.3; __utmb=1.7.10.1451643956; __utmc=1; __utmz=1.1451565630.1.1.utmgclid=CMWOra2PhsoCFQkMcwod4KALDQ|utmccn=(not%20set)|utmcmd=(not%20set)|utmctr=(not%20provided); BNI_Feline.Web.FelineHolidays=0000000000000000000000009b84f30a00000000",
"Host":"www.feline.dk",
"Origin":"http://www.feline.dk",
#"Referer":"http://www.feline.dk/search/Y2WZNDPglgHHXpe2uUwFu0r-JzExMYi6yif5KNswMDBwMDAAAA/",
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
 }

gr = r.post(
    url = 'http://www.feline.dk/search',
    data = payload,
    headers = header
    )

#print(gr.url)
bsObj = BeautifulSoup(gr.text,"html.parser")
print(bsObj)

2 个答案:

答案 0 :(得分:1)

在多次尝试之后,我发现您的搜索请求格式错误(需要是URL编码而不是JSON),并且标题中的cookie信息被覆盖(只需让会话完成工作)。

我简化了代码,我得到了理想的结果

r = requests.Session()

# GET request  
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']

# POST request
search_request = "Geography.Geography=Hou&Geography.GeographyLong=Hou%2C+Danmark+(Ferieomr%C3%A5de)&Geography.Id=847fcbc5-0795-4396-9318-01e638f3b0f6&Geography.SuggestionId=&Period.Arrival=&Period.Duration=7&Period.ArrivalCorrection=False&Price.MinPrice=&Price.MaxPrice=&Price.MinDiscountPercentage=&Accommodation.MinPersonNumber=&Accommodation.MinBedrooms=&Accommodation.NumberOfPets=&Accommodation.MaxDistanceWater=&Accommodation.MaxDistanceShopping=&Facilities.SwimmingPool=false&Facilities.Whirlpool=false&Facilities.Sauna=false&Facilities.InternetAccess=false&Facilities.SatelliteCableTV=false&Facilities.FireplaceStove=false&Facilities.Dishwasher=false&Facilities.WashingMachine=false&Facilities.TumblerDryer=false"

gr = r.post(
    url = 'http://www.feline.dk/search/',
    data = search_request,
    headers = {'Content-Type': 'application/x-www-form-urlencoded'}
)

print(gr.url)

结果:

http://www.feline.dk/search/Y2U5erq-ZSr7NOfJEozPLD5v-MZkw8DAwMHAAAA/

答案 1 :(得分:0)

感谢Kantium的回答,就我而言,我发现RequestVerificationToken实际上是在页面内的JS脚本中生成的。

1-调用生成代码的第一页,就我而言,它在HTML中返回了类似的内容:

<script>
    Sys.Net.WebRequestManager.add_invokingRequest(function (sender, networkRequestEventArgs) {
        var request = networkRequestEventArgs.get_webRequest();
        var headers = request.get_headers();
        headers['RequestVerificationToken'] = '546bd932b91b4cdba97335574a263e47';
    });
  
    $.ajaxSetup({
        beforeSend: function (xhr) {
            xhr.setRequestHeader("RequestVerificationToken", '546bd932b91b4cdba97335574a263e47');
        },
        complete: function (result) {
            console.log(result);
        },
    });

</script>

2-抓取RequestVerificationToken代码,然后将其与set-cookie中的cookie一起添加到您的请求中。

 let resp_setcookie = response.headers["set-cookie"];
 let rege = new RegExp(/(?:RequestVerificationToken", ')(\S*)'/);
 let token = rege.exec(response.body)[1];

我实际上将它们存储在全局变量中,然后在我的Nodejs Request中将其添加到请求对象中:

headers.Cookie = gCookies.cookie;
headers.RequestVerificationToken = gCookies.token;

这样最终请求看起来像这样:

enter image description here

请记住,您可以监视使用以下方式发送的请求:

require("request-debug")(requestpromise);

祝你好运!