我尝试从此amazon page获取数据但是我总是被发送回HP的主页。我已经尝试过其他帖子中有关更改我的USER_AGENT的建议,但在这种情况下没有帮助。另外,因为默认情况下我在scrapy中启用了cookie,所以我甚至尝试通过其他页面首先访问我想要的页面,但是没有用。
这是我的代码:
def parse(self, response):
url_to_go = "http://amazon.com"+(response.xpath('//*[@id="refinements"]/div[2]/ul[1]/li[1]/ul/li[1]/a/@href').extract()[0])
cook = {'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'session-token="lslMFQ/aZv4uOPOndfqyl4uQo+2j28Ziy3aMBwCCUsVPeFX9xoCsUv6jvR2U+YAnSxlBVTl4PtTpCeaIA13g2/XC1DqNd95tDulSOPeEbxETVBgwS4i/vTIQmUOybv+I5wYP12XCIGOh7QrpGLE+/gGTgAjM+1KaA9Ua6D2lEZoPPyONk8K4MiWAOxbjOVgaV/i5lbEbp1Kfn4PbXl555g=="; x-wl-uid=1ZDt5hegLdX+sR4SzNbD6q5TZD/tTVmo+y68B5HuediDPf5/oClQ5IbnNGcF0D+ollnxQ1vp63iw=; csm-hit=s-0AMVHWRT97Q8WVYKN4TA|1464956175963; session-id-time=2082787201l; session-id=186-8139005-7450816; ubid-main=181-6639382-1676153',
'Host':'www.amazon.com',
'Origin':'http://www.amazon.com',
'Referer':'http://www.amazon.com/s/ref=sr_nr_scat_13896615011_2529580011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=1464897830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e'}
request = Request(url_to_go, headers = cook, meta={'asinlist':[]}, callback=self.scrape)
return request
我返回的请求是对我实际想要抓取的页面的请求。有谁知道我能从这个网页上抓一些东西吗?
Response as requested:
URLError: <urlopen error timed out>
2016-06-03 08:50:09 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 08:50:09 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 08:50:09 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 08:50:09 [scrapy] INFO: Enabled item pipelines: LinkcrawlerPipeline
2016-06-03 08:50:09 [scrapy] INFO: Spider opened
2016-06-03 08:50:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 08:50:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 08:50:10 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.com/s/r
ef=sr_nr_n_0?srs=2529580011> (referer: None)
2016-06-03 08:50:11 [scrapy] DEBUG: Crawled (200) <GET http://amazon.com/s/ref=s
r_nr_scat_13896615011_2529580011_ln/182-9928517-8357940?srs=2529580011&rh=n%3A13
896615011&ie=UTF8&qid=1464958209&scn=13896615011&h=0821a5f910d5cd8152cac30d0456f
e79538da1fb> (referer: http://www.amazon.com/s/ref=sr_nr_scat_13896615011_252958
0011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=146489
7830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e)
2016-06-03 08:50:11 [scrapy] ERROR: Spider error processing <GET http://amazon.c
om/s/ref=sr_nr_scat_13896615011_2529580011_ln/182-9928517-8357940?srs=2529580011
&rh=n%3A13896615011&ie=UTF8&qid=1464958209&scn=13896615011&h=0821a5f910d5cd8152c
ac30d0456fe79538da1fb> (referer: http://www.amazon.com/s/ref=sr_nr_scat_13896615
011_2529580011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&
qid=1464897830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e)
Traceback (most recent call last):
File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\twist
ed\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\usr\Desktop\linkcrawler\linkcrawler\spiders\crawl.py", line 6
5, in scrape
item_list = response.xpath('//*[contains(@id=,"result_"]').extract()
File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\scrap
y\http\response\text.py", line 109, in xpath
return self.selector.xpath(query)
File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\scrap
y\selector\unified.py", line 100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //*[contains(@id=,"result_"]
2016-06-03 08:50:11 [scrapy] INFO: Closing spider (finished)
2016-06-03 08:50:11 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1199,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 147875,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 3, 12, 50, 11, 340000),
'log_count/DEBUG': 4,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2016, 6, 3, 12, 50, 9, 310000)}
2016-06-03 08:50:11 [scrapy] INFO: Spider closed (finished)
如果您尝试使用命令
访问我想要在shell中搜索的页面scrapy shell -s USER_AGENT="whatever" http://www.amazon.com/s/ref=sr_nr_scat_13896615011_2529580011_ln?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=1464956176&scn=13896615011&h=4f68f64151566295c00f8b63baa57f9aa1cdeb07
并且您尝试查看(响应),您将看到scrapy看到的页面不是URL中的页面。