要求具有Cookie的Web爬虫页面

时间:2020-03-18 17:25:48

标签: python web-scraping python-requests

我正在尝试抓取this booking website的结果。该站点放置一个cookie来识别会话。我尝试用requests复制它,但是我仍然收到Invalid Session ID error。我在做什么错了?

url = 'https://alilauro-tickets.certusonline.com/php/proxy.php'
s = requests.Session()
s.get(url)
data = {
    'msg': 'TimeTable',
    'req': '{"getAvailability":"Y","getBasicPrice":"Y","getRouteAnalysis":"Y","directOnly":"Y","legs":1,"pax":1,"origin":"BEV","destination":"FOR","tripRequest":[{"tripfrom":"BEV","tripto":"FOR","tripdate":"2020-03-21","tripleg":0}]}'
}
r = s.post(url, data=data, cookies=s.cookies)

这是我得到的错误:

'sessionID': none, 'errorCode': '620', 'errorDescription': 'Invalid Session Number'

以下是cookie信息: Cookie informaiton

1 个答案:

答案 0 :(得分:0)

确实,当您调用https://alilauro-tickets.certusonline.com/php/proxy.php时存在cookie,但该cookie在Javascript函数调用https://alilauro-tickets.certusonline.com/php/proxy.php?msg=Connect之前无效。正如Dan-Dev在评论中提到的那样,这是针对CSRF的保护。

使用以下方法将起作用:

import requests
import json

url = "https://alilauro-tickets.certusonline.com/php/proxy.php"

session = requests.Session()

r = session.post(url, data= { "msg": "Connect"})
r = session.post(url, data= { 
    "msg": "TimeTable", 
    "req": json.dumps({
        "getAvailability":"Y",
        "getBasicPrice":"Y",
        "getRouteAnalysis":"Y",
        "directOnly":"Y",
        "legs":"1",
        "pax":1,
        "origin":"FOR",
        "destination":"BEV",
        "tripRequest":[{
            "tripfrom":"FOR",
            "tripto":"BEV",
            "tripdate":"2020-03-20",
            "tripleg":0
        }]
    })
})

print(json.loads(r.text)["VWS_Trips_Trip"])
相关问题