我正在尝试抓取Jaap.nl,但是我遇到了一些困难。例如,当您要搜索阿姆斯特丹的城市时,会将您重定向到仅包含阿姆斯特丹以外的网址。
base_url:https://www.jaap.nl/koophuizen/> https://www.jaap.nl/koophuizen/noord+holland/groot-amsterdam/amsterdam
我想捕获额外的内容(noord + holland / groot-amsterdam / amsterdam)。我看到在将get重定向到该页面之前,存在一个Post请求,以获取标题中的扩展URL作为位置,但是我无法在代码中捕获该片段。参见下面的代码:
def post_page(type="koophuizen", city="amsterdam"):
url = f"https://www.jaap.nl/{type}"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0",
"Content-Type": "application/x-www-form-urlencoded"}
payload = {"action": "searchExtensive",
"url": f"/{type}",
"search_input_extensive": city}
response = requests.post(url, data=json.dumps(payload), headers=headers)
print(response.headers)
post_page()
我收到以下答复:
{'Cache-Control': 'private',
'Content-Type': 'text/html; charset=utf-8',
'Content-Encoding': 'gzip',
'Vary': 'Accept-Encoding',
'Server': 'Microsoft-IIS/8.5',
'Set-Cookie': 'SESSIONToken=7f8c65d3-7962-41a8-9604-a996957fd0ad; expires=Tue, 20-Nov-2029 23:11:36 GMT; path=/; HttpOnly, lastcity=76; path=/',
'X-AspNetMvc-Version': '4.0',
'X-AspNet-Version': '4.0.30319',
'X-Powered-By': 'ASP.NET, ARR/3.0, ASP.NET',
'strict-transport-security': 'max-age=31536000; includeSubdomains',
'X-Handled-By': 'TORNADO',
'X-Jaap-Router': 'Routed',
'X-Frame-Options': 'SAMEORIGIN',
'Date': 'Wed, 20 Nov 2019 23:11:36 GMT',
'Content-Length': '32956'}
正在寻找:
"Location": "/koophuizen/noord+holland/groot-amsterdam/amsterdam"
正如我在浏览器中检查发帖请求响应标头时所看到的那样
我不断得到200作为响应代码,而即使allow_redirects = False,我也在寻找302并使用Session来保存cookie,我无法使其正常工作。
有人可以告诉我我在做什么错吗...?
答案 0 :(得分:1)
这对我有用
import requests
city_to_search=str(input("Insert your city"))
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'it-IT,it;q=0.8,en-US;q=0.5,en;q=0.3',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'https://www.jaap.nl',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.jaap.nl/koophuizen/',
'Upgrade-Insecure-Requests': '1',
}
data = {
'action': 'searchExtensive',
'url': '/koophuizen',
'search_input_extensive': city_to_search
}
response = requests.post('https://www.jaap.nl/', headers=headers, data=data)