我正在尝试通过以下方法检索来自以下网站的所有结果:
class MyPropertySpider(scrapy.Spider):
name = 'my_property'
start_urls = [
'https://www.myproperty.co.za/search?last=1y&coords%5Blat%5D=-33.2277918&coords%5Blng%5D=21.8568586&coords%5Bnw%5D%5Blat%5D=-30.4302599&coords%5Bnw%5D%5Blng%5D=17.7575637&coords%5Bse%5D%5Blat%5D=-47.1313489&coords%5Bse%5D%5Blng%5D=38.2216904&description=Western%20Cape%2C%20South%20Africa&status=For%20Sale',
]
def parse(self, response):
headers = {
'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.myproperty.co.za',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.myproperty.co.za/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'
response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', headers=headers,
data=data)
但是,我只能从该页面上获得200个结果,即使给定的搜索页面上提供了1000多个结果。我看到请求中的数据限制为210,而当我尝试增加时,它没有变化。我不确定如何(或是否可以?)解决此问题? 有什么建议? 预先感谢!
答案 0 :(得分:1)
由于您使用的是scrapy,因此建议您使用FormRequest
而不是requests
lib。两者都可以执行相同的POST请求。 Here is the docs,如果您想阅读此方法。
这是您正在传递的表单数据,它为服务器提供了您感兴趣的所有搜索参数。
data = {
"clientOfficeId": [],
"countryCode":"za",
"sortField":"distance",
"sortOrder":"asc",
"last":"0.5y",
"statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],
"coords":{"lat":"-33.9248685","lng":"18.4240553",
"nw":{"lat":"-33.47127","lng":"18.3074488"},
"se":{"lat":"-34.3598061","lng":"19.00467"}},
"radius":2500,
"nearbySuburbs":True,
"limit":210,
"start":0
}
由于服务器不愿一次为您提供所有数据(我尚未测试,但是您说增加限制并不会改变结果),因此希望您这样做就像在网站上一样,对数据进行“分页”。
当您发送上面的表格时,它将返回210个结果,因此,下次调用它时,您需要告诉服务器您想要 NEXT 210个结果,而不是已经收到的结果。为此,您将使用表单中的start
字段。在下一个请求中,使用"start":210
并继续累加,直到服务器开始返回空响应。 (通常,响应并非完全为空,但是结果字段返回为空)