我对scrapy很新。但是我养了一些蜘蛛。
我正在尝试搜索此网站http://www.reviewed.com/search/products?sort=rating,desc的评论。通过firebug,我可以看到有一个POST请求,它发送了一个带有产品的json响应。 到目前为止,我的蜘蛛看起来像这样:
import scrapy
import json
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem
from reviews import utils
class ReviewedbaseSpider(scrapy.Spider):
name = "reviewedbase" #spider name to call in terminal
allowed_domains = ['reviewed.com'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.reviewed.com/search/products?sort=rating,desc'] #url from which the spider will start crawling
url_post = 'http://cerebro-production.herokuapp.com/products_production/_search'
def parse(self, response):
bodydata = '{"query":{"filtered":{"query":{"function_score":{"query":{"match_all":{}},"functions":[{"filter":{"exists":{"field":"review_publish_on"}},"gauss":{"review_publish_on":{"origin":"now","scale":"700d","decay":0.9}}},{"filter":{"term":{"comparable_name":"car"}},"boost_factor":"1.2"},{"linear":{"rating":{"origin":10,"scale":10,"decay":0.9}}},{"filter":{"term":{"archived":false}},"boost_factor":"1.2"},{"filter":{"term":{"has_rating":true}},"boost_factor":"1.2"},{"filter":{"exists":{"field":"review_publish_on"}},"boost_factor":"1.2"}],"score_mode":"multiply"}},"filter":{"and":[{"match_all":{}},{"not":{"terms":{"website_ids":["cruises","cookware","cutlery","kitchenequipment","lenses"]}}}]}}},"filter":{},"sort":[{"archived":"asc"},{"rating":"desc"}],"size":20,"from":0,"aggs":{"has_rating":{"aggs":{"has_rating":{"terms":{"field":"has_rating","size":0}}},"filter":{"and":[{"match_all":{}}]}},"has_awards":{"aggs":{"has_awards":{"terms":{"field":"has_awards","size":0}}},"filter":{"and":[{"match_all":{}}]}}}}'
req = Request(url=self.url_post,
method="POST",
headers={"Accept": "application/json, text/plain, */*", "Accept-Encoding": "gzip, deflate", "Accept-Language": "pt-PT,pt;q=0.8,en;q=0.5,en-US;q=0.3", "Cache-Control": " no-cache", "Connection": "keep-alive", "Content-Length": "1009", "Content-Type": "application/json;charset=utf-8", "Pragma": "no-cache", "Referer": "http://www.reviewed.com/search/products?sort=rating,desc", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"},
dont_filter=True,
body=json.dumps(bodydata),
callback=self.parse_json)
print "------REQUEST-----------" + str(req)
yield req
def parse_json(self, response):
print "---------------RESPONSE.BODY---------------------" + str(response.body)
json_data = json.loads(response.body)
#print "-----------------JSON_DATA----------------------" + str(json_data)
#print "--------HITS----------" + str(json_data["hits"])
def parse_review(self, response):
pass
令我感到困惑的是,我的回复。来自网站的反应不一样,我在firebug分隔符“回复”中看到了。在scrapy蜘蛛中,只有10个点击,并且在网站的响应中有20个点击,并且在页面中确实有20个产品(start_url)。我究竟做错了什么?我传递了错误的请求吗?
我已经在RESTclient上测试了标题和身体请求中的bodydata并且有好处,也有20次点击。
使用这个蜘蛛我有一个错误:
2015-05-22 16:34:58+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)
2015-05-22 16:34:58+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-22 16:34:58+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'reviews'}
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RotateUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-22 16:34:58+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-22 16:34:58+0100 [reviewedbase] INFO: Spider opened
2015-05-22 16:34:58+0100 [reviewedbase] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-22 16:34:58+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6058
2015-05-22 16:34:58+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6115
2015-05-22 16:34:59+0100 [reviewedbase] DEBUG: Crawled (200) <GET http://www.reviewed.com/search/products?sort=rating,desc> (referer: None)
------REQUEST-----------<POST http://cerebro-production.herokuapp.com/products_production/_search>
2015-05-22 16:34:59+0100 [reviewedbase] DEBUG: Retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 1 times): 400 Bad Request
2015-05-22 16:35:00+0100 [reviewedbase] DEBUG: Retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 2 times): 400 Bad Request
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Gave up retrying <POST http://cerebro-production.herokuapp.com/products_production/_search> (failed 3 times): 400 Bad Request
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Crawled (400) <POST http://cerebro-production.herokuapp.com/products_production/_search> (referer: http://www.reviewed.com/search/products?sort=rating,desc)
2015-05-22 16:35:02+0100 [reviewedbase] DEBUG: Ignoring response <400 http://cerebro-production.herokuapp.com/products_production/_search>: HTTP status code is not handled or not allowed
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Closing spider (finished)
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5384,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 3,
'downloader/response_bytes': 52191,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 22, 15, 35, 2, 371278),
'log_count/DEBUG': 8,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 5, 22, 15, 34, 58, 880902)}
2015-05-22 16:35:02+0100 [reviewedbase] INFO: Spider closed (finished)
400是一个糟糕的请求,对吗? :/我能做什么/改变什么?
非常感谢您的时间和帮助。