爬了(200)但没有被刮 - Crawlera

时间:2017-02-13 03:39:57

标签: python mongodb scrapy web-crawler

嗨,我再次参加C10计划并试图抓住亚马逊网站;

我遇到了这个问题,有时日志说网站被抓取但是它不会刮掉我想要的数据,而是按照我的指示跳到下一页。从某些页面上它会刮掉一些我不明白的东西。就像我检查了代码和网址的html一样,有些项目要在网站上进行抓取,它说它已经抓了但没刮过。任何人都可以帮我理解最新情况吗?我想也许网站可能会返回验证码,但即便如此,我还是认为crawlera会自动重试它获取验证码的请求。

以下是日志:

'time': '2017-02-12',
'title': u'Basic GIS Coordinates, Second Edition',
'url': u'https://www.amazon.com/Basic-GIS-Coordinates-Second-Sickle/dp/1420092316/ref=sr_1_64?s=tradein-aps&srs=9187220011&ie=UTF8&qid=1486932384&sr=1-64'}
2017-02-12 14:46:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None)
2017-02-12 14:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_2/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A52187011&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None)
2017-02-12 14:46:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011> (referer: https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541)
2017-02-12 14:46:44 [scrapy.log] DEBUG: successfully added!
2017-02-12 14:46:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011>
{'currency': u'$',

1 个答案:

答案 0 :(得分:0)

当您正在爬行亚马逊时,我的猜测是您将获得“验证码”页面而不是常规产品页面。

也许您应该打印回复的内容而不是仅仅返回项目,然后您将确定哪些页面被完全抓取。