我试图对通过Javascript生成数据的网站进行网页抓取。我已经在这里做了足够的阅读,现在知道刮掉这些的方法是:
因此,当我执行1时,会向此屏幕截图中显示的链接发送POST请求: 你也可以看到它得到的回应。看起来很棒,对吗?
但是当我尝试重新创建该请求时响应,我在Firebug的Post选项卡下看到的有效负载,在Python中如下:
import requests
from bs4 import BeautifulSoup
payload = {"Max":999,"RectCoord":"89,-179,-89,179","Source":"","SortField":"NEWID()","OfficeName":"","FirstName"
:"","LastName":"da","CityName":"","ZipCode":"","Category":"S","SecLanguageReq":"","OfficeCode":""}
r = requests.post('http://search.cnyrealtor.com/MyAjaxService.asmx/MemberSearch', data=payload)
print(r.content)
我收到一个显示错误消息的页面:
Request format is unrecognized for URL unexpectedly ending in \'/MemberSearch\'
所以,我的问题是 - 当Firebug中的响应正常时,为什么我得到了响应?我在Python脚本的requests.post(url)
行中遗漏了什么吗?
答案 0 :(得分:1)
您需要将字典转储为JSON并作为有效负载发送。设置Content-Type
请求标头也很重要:
import json
import requests
payload = {"Max": 999, "RectCoord": "89,-179,-89,179", "Source": "", "SortField": "NEWID()", "OfficeName": "",
"FirstName": "", "LastName": "", "CityName": "", "ZipCode": "", "Category": "S", "SecLanguageReq": "",
"OfficeCode": ""}
with requests.Session() as session:
session.get("http://search.cnyrealtor.com/SiteContent/SYR/MemberSearchSYR.aspx")
r = session.post('http://search.cnyrealtor.com/MyAjaxService.asmx/MemberSearch', data=json.dumps(payload),
headers={"Content-Type": "application/json; charset=UTF-8"})
print(r.content)