无法从搜索页面抓取所有结果

时间:2020-10-19 23:34:48

标签: javascript json web-scraping scrapy web-crawler

我正在尝试通过以下方法检索来自以下网站的所有结果:

class MyPropertySpider(scrapy.Spider):
    name = 'my_property'
    start_urls = [
        'https://www.myproperty.co.za/search?last=1y&coords%5Blat%5D=-33.2277918&coords%5Blng%5D=21.8568586&coords%5Bnw%5D%5Blat%5D=-30.4302599&coords%5Bnw%5D%5Blng%5D=17.7575637&coords%5Bse%5D%5Blat%5D=-47.1313489&coords%5Bse%5D%5Blng%5D=38.2216904&description=Western%20Cape%2C%20South%20Africa&status=For%20Sale',
    ]

    def parse(self, response):
        headers = {
            'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
            'pragma': 'no-cache',
            'cache-control': 'no-cache',
            'accept': 'application/json, text/plain, */*',
            'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36',
            'content-type': 'application/json;charset=UTF-8',
            'origin': 'https://www.myproperty.co.za',
            'sec-fetch-site': 'cross-site',
            'sec-fetch-mode': 'cors',
            'sec-fetch-dest': 'empty',
            'referer': 'https://www.myproperty.co.za/',
            'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
        }

        data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'

        response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', headers=headers,
                                 data=data)

但是,我只能从该页面上获得200个结果,即使给定的搜索页面上提供了1000多个结果。我看到请求中的数据限制为210,而当我尝试增加时,它没有变化。我不确定如何(或是否可以?)解决此问题? 有什么建议? 预先感谢!

1 个答案:

答案 0 :(得分:1)

由于您使用的是scrapy,因此建议您使用FormRequest而不是requests lib。两者都可以执行相同的POST请求。 Here is the docs,如果您想阅读此方法。


这是您正在传递的表单数据,它为服务器提供了您感兴趣的所有搜索参数。

data = {
    "clientOfficeId": [],
    "countryCode":"za",
    "sortField":"distance",
    "sortOrder":"asc",
    "last":"0.5y",
    "statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],
    "coords":{"lat":"-33.9248685","lng":"18.4240553",
    "nw":{"lat":"-33.47127","lng":"18.3074488"},
    "se":{"lat":"-34.3598061","lng":"19.00467"}},
    "radius":2500,
    "nearbySuburbs":True,
    "limit":210,
    "start":0
}

由于服务器不愿一次为您提供所有数据(我尚未测试,但是您说增加限制并不会改变结果),因此希望您这样做就像在网站上一样,对数据进行“分页”。

当您发送上面的表格时,它将返回210个结果,因此,下次调用它时,您需要告诉服务器您想要 NEXT 210个结果,而不是已经收到的结果。为此,您将使用表单中的start字段。在下一个请求中,使用"start":210并继续累加,直到服务器开始返回空响应。 (通常,响应并非完全为空,但是结果字段返回为空)