scrapy json response.body与网站中的json响应不同

时间:2015-05-22 11:59:24

标签: python json request scrapy response

我对scrapy很新。但是我养了一些蜘蛛。

我正在尝试搜索此网站http://www.reviewed.com/search/products?sort=rating,desc的评论。通过firebug,我可以看到有一个POST请求,它发送了一个带有产品的json响应。 到目前为止,我的蜘蛛看起来像这样:

import scrapy
import json
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem
from reviews import utils

class ReviewedbaseSpider(scrapy.Spider):
    name = "reviewedbase" #spider name to call in terminal
    allowed_domains = ['reviewed.com'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.reviewed.com/search/products?sort=rating,desc'] #url from which the spider will start crawling

    url_post = 'http://cerebro-production.herokuapp.com/products_production/_search'

    def parse(self, response):
        bodydata = '{"query":{"filtered":{"query":{"function_score":{"query":{"match_all":{}},"functions":[{"filter":{"exists":{"field":"review_publish_on"}},"gauss":{"review_publish_on":{"origin":"now","scale":"700d","decay":0.9}}},{"filter":{"term":{"comparable_name":"car"}},"boost_factor":"1.2"},{"linear":{"rating":{"origin":10,"scale":10,"decay":0.9}}},{"filter":{"term":{"archived":false}},"boost_factor":"1.2"},{"filter":{"term":{"has_rating":true}},"boost_factor":"1.2"},{"filter":{"exists":{"field":"review_publish_on"}},"boost_factor":"1.2"}],"score_mode":"multiply"}},"filter":{"and":[{"match_all":{}},{"not":{"terms":{"website_ids":["cruises","cookware","cutlery","kitchenequipment","lenses"]}}}]}}},"filter":{},"sort":[{"archived":"asc"},{"rating":"desc"}],"size":20,"from":0,"aggs":{"has_rating":{"aggs":{"has_rating":{"terms":{"field":"has_rating","size":0}}},"filter":{"and":[{"match_all":{}}]}},"has_awards":{"aggs":{"has_awards":{"terms":{"field":"has_awards","size":0}}},"filter":{"and":[{"match_all":{}}]}}}}'
        req = Request(url=self.url_post,
                    method="POST",
                    headers={"Accept": "application/json, text/plain, */*", "Accept-Encoding": "gzip, deflate", "Accept-Language": "pt-PT,pt;q=0.8,en;q=0.5,en-US;q=0.3", "Cache-Control": " no-cache", "Connection": "keep-alive", "Content-Length":  "1009", "Content-Type": "application/json;charset=utf-8", "Pragma": "no-cache", "Referer": "http://www.reviewed.com/search/products?sort=rating,desc", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"},
                    dont_filter=True,
                    body=json.dumps(bodydata),
                    callback=self.parse_json)
        print "------REQUEST-----------" + str(req)
        yield req

    def parse_json(self, response):
        print "---------------RESPONSE.BODY---------------------" + str(response.body)

        json_data = json.loads(response.body)

        #print "-----------------JSON_DATA----------------------" + str(json_data)

        #print "--------HITS----------" + str(json_data["hits"])


    def parse_review(self, response):
         pass

令我感到困惑的是,我的回复。来自网站的反应不一样,我在firebug分隔符“回复”中看到了。在scrapy蜘蛛中,只有10个点击,并且在网站的响应中有20个点击,并且在页面中确实有20个产品(start_url)。我究竟做错了什么?我传递了错误的请求吗?

我已经在RESTclient上测试了标题和身体请求中的bodydata并且有好处,也有20次点击。

使用这个蜘蛛我有一个错误:

2015-05-22 16:34:58+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)
2015-05-22 16:34:58+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-22 16:34:58+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'reviews'}
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RotateUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-22 16:34:58+0100 [reviewedbase] INFO: Spider opened
2015-05-22 16:34:58+0100 [reviewedbase] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-22 16:34:58+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6058
2015-05-22 16:34:58+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6115
2015-05-22 16:34:59+0100 [reviewedbase] DEBUG: Crawled (200) <GET http://www.reviewed.com/search/products?sort=rating,desc> (referer: None)
------REQUEST-----------<POST http://cerebro-production.herokuapp.com/products_production/_search>
2015-05-22 16:34:59+0100 [reviewedbase] DEBUG: Retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 1 times): 400 Bad Request
2015-05-22 16:35:00+0100 [reviewedbase] DEBUG: Retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 2 times): 400 Bad Request
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Gave up retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 3 times): 400 Bad Request
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Crawled (400) <POST http://cerebro-production.herokuapp.com/products_production/_search> (referer: http://www.reviewed.com/search/products?sort=rating,desc)
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Ignoring response <400 http://cerebro-production.herokuapp.com/products_production/_search>: HTTP status code is not handled or not allowed
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Closing spider (finished)
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 5384,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 1,
     'downloader/request_method_count/POST': 3,
     'downloader/response_bytes': 52191,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/400': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 22, 15, 35, 2, 371278),
     'log_count/DEBUG': 8,
     'log_count/INFO': 7,
     'request_depth_max': 1,
     'response_received_count': 2,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2015, 5, 22, 15, 34, 58, 880902)}
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Spider closed (finished)

400是一个糟糕的请求,对吗? :/我能做什么/改变什么?

非常感谢您的时间和帮助。

0 个答案:

没有答案