Scrapy-请求有效载荷格式和类型

时间:2019-06-28 20:18:06

标签: python ajax web-scraping request scrapy

这是我抓取过程的起点。

https://www.storiaimoveis.com.br/alugar/brasil

这是AJAX调用,它以JSON格式返回每个页面的数据。

https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e

我的POST请求失败,错误404。这些请求需要有效负载,这给我带来了麻烦。我总是以某种方式解决了这个问题,但是现在我试图了解我对他们的错。

我的问题是;

  • 与scrapy请求一起发送的请求有效载荷是否需要特定的类型或格式?
  • 在发送json.dumps(payload)还是将其作为字典发送之前,我需要打给他们吗?
  • 在发送有效负载之前,我是否需要将每个key:value对转换为字符串?
  • 是我的请求失败的其他原因吗?

这是我代码的相关部分。

class MySpider(CrawlSpider):

    name = 'myspider'

    start_urls = [
        'https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e'
    ]

    page = 1
    payload = {"locations":[{"geo":{"top_left":{"lat":5.2717863,
                                                "lon":-73.982817},
                                    "bottom_right":{"lat":-34.0891,
                                                    "lon":-28.650543}},
                             "placeId":"ChIJzyjM68dZnAARYz4p8gYVWik",
                             "keywords":"Brasil",
                             "address":{"label":"Brasil","country":"BR"}}],
               "operation":["RENT"],
               "bathrooms":[],
               "bedrooms":[],
               "garage":[],
               "features":[]}
    headers = {
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Referer': 'https://www.storiaimoveis.com.br/alugar/brasil',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }


    def parse(self, response):
        for url in self.start_urls:
            yield scrapy.Request(url=url,
                                 method='POST',
                                 headers=self.headers,
                                 body=json.dumps(self.payload),
                                 callback=self.parse_items)

    def parse_items(self, response):
        from scrapy.shell import inspect_response
        inspect_response(response, self)
        print response.text

1 个答案:

答案 0 :(得分:1)

是的,您需要调用DailyFieldRecordID,因为请求主体必须是 DailyFieldRecord: AB953: DailyFieldRecordID ActivityCodeID DailyFieldRecordID: ItemID: GroupID: 657 387 657 1305 210 888 420 657 1333 260 672 387 657 1335 260 657 1302 210 657 1334 260 657 1111 111 888 1302 210 888 1336 260 672 1327 260 672 1334 260 672 1335 260 672 1322 260 672 1222 420 Expected Output: Count1: Count2: 4 3 Count1 is supposed to count: Count2 is supposed to count: 672 1327 260 657 1333 260 672 1334 260 657 1335 260 672 1335 260 657 1334 260 672 1322 260 Current Count: Count1: Count2: 4 6 SELECT sum(CASE WHEN ex=0 THEN 1 ELSE 0 END) AS COUNT1,sum(EX) AS COUNT2 FROM AB953 ab JOIN DailyFieldRecord dfr ON dfr.DailyFieldRecordID = ab.DailyFieldRecordID JOIN ( SELECT AB1.DailyFieldRecordID,sum(CASE WHEN AB1.ItemID IN (1302,1303,1305,1306) THEN 1 ELSE 0 END) AS EX FROM AB953 AB1 GROUP BY AB1.DailyFieldRecordID) T ON dfr.DailyFieldRecordID = T.DailyFieldRecordID WHERE dfr.ActivityCodeID = 387 AND ab.GroupID = 260 ,如文档中所述:https://docs.scrapy.org/en/latest/topics/request-response.html#request-objects

但是,在您的情况下,由于缺少这两个标头:#include <stdio.h> void main() { int k=8; int m=7; k<m ? k=k+1 : m=m+1; printf("%d",k); } json.dumps(payload),您的请求失败。

为了获得正确的请求标头,我通常要做的是:

  1. 检查Chrome开发者工具中的标题:

enter image description here

  1. 使用str or unicodeContent-Type发出请求,直到获得正确的标题为止。在这种情况下,对于HTTP 200响应状态,Referercurl似乎足够了:

enter image description here

enter image description here