Scrapy被重定向

时间:2016-06-03 12:41:26

标签: python web scrapy web-crawler

我尝试从此amazon page获取数据但是我总是被发送回HP的主页。我已经尝试过其他帖子中有关更改我的USER_AGENT的建议,但在这种情况下没有帮助。另外,因为默认情况下我在scrapy中启用了cookie,所以我甚至尝试通过其他页面首先访问我想要的页面,但是没有用。

这是我的代码:

def parse(self, response):
    url_to_go = "http://amazon.com"+(response.xpath('//*[@id="refinements"]/div[2]/ul[1]/li[1]/ul/li[1]/a/@href').extract()[0])

    cook = {'Accept-Encoding':'gzip, deflate', 
    'Accept-Language':'en-US,en;q=0.8',
    'Connection':'keep-alive',
    'Content-Type':'application/x-www-form-urlencoded',
    'Cookie':'session-token="lslMFQ/aZv4uOPOndfqyl4uQo+2j28Ziy3aMBwCCUsVPeFX9xoCsUv6jvR2U+YAnSxlBVTl4PtTpCeaIA13g2/XC1DqNd95tDulSOPeEbxETVBgwS4i/vTIQmUOybv+I5wYP12XCIGOh7QrpGLE+/gGTgAjM+1KaA9Ua6D2lEZoPPyONk8K4MiWAOxbjOVgaV/i5lbEbp1Kfn4PbXl555g=="; x-wl-uid=1ZDt5hegLdX+sR4SzNbD6q5TZD/tTVmo+y68B5HuediDPf5/oClQ5IbnNGcF0D+ollnxQ1vp63iw=; csm-hit=s-0AMVHWRT97Q8WVYKN4TA|1464956175963; session-id-time=2082787201l; session-id=186-8139005-7450816; ubid-main=181-6639382-1676153',
    'Host':'www.amazon.com',
    'Origin':'http://www.amazon.com',
    'Referer':'http://www.amazon.com/s/ref=sr_nr_scat_13896615011_2529580011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=1464897830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e'}



    request = Request(url_to_go, headers = cook, meta={'asinlist':[]}, callback=self.scrape)
    return request

我返回的请求是对我实际想要抓取的页面的请求。有谁知道我能从这个网页上抓一些东西吗?

Response as requested:
URLError: <urlopen error timed out>
2016-06-03 08:50:09 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 08:50:09 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 08:50:09 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 08:50:09 [scrapy] INFO: Enabled item pipelines: LinkcrawlerPipeline
2016-06-03 08:50:09 [scrapy] INFO: Spider opened
2016-06-03 08:50:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 08:50:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 08:50:10 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.com/s/r
ef=sr_nr_n_0?srs=2529580011> (referer: None)
2016-06-03 08:50:11 [scrapy] DEBUG: Crawled (200) <GET http://amazon.com/s/ref=s
r_nr_scat_13896615011_2529580011_ln/182-9928517-8357940?srs=2529580011&rh=n%3A13
896615011&ie=UTF8&qid=1464958209&scn=13896615011&h=0821a5f910d5cd8152cac30d0456f
e79538da1fb> (referer: http://www.amazon.com/s/ref=sr_nr_scat_13896615011_252958
0011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=146489
7830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e)
2016-06-03 08:50:11 [scrapy] ERROR: Spider error processing <GET http://amazon.c
om/s/ref=sr_nr_scat_13896615011_2529580011_ln/182-9928517-8357940?srs=2529580011
&rh=n%3A13896615011&ie=UTF8&qid=1464958209&scn=13896615011&h=0821a5f910d5cd8152c
ac30d0456fe79538da1fb> (referer: http://www.amazon.com/s/ref=sr_nr_scat_13896615
011_2529580011_ln/184-2672682-0617523?srs=2529580011&rh=n%3A13896615011&ie=UTF8&
qid=1464897830&scn=13896615011&h=d5e0fad5fcbb448fb4f65192f8ecdc6f0425487e)
Traceback (most recent call last):
  File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\twist
ed\internet\defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\usr\Desktop\linkcrawler\linkcrawler\spiders\crawl.py", line 6
5, in scrape
    item_list = response.xpath('//*[contains(@id=,"result_"]').extract()
  File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\scrap
y\http\response\text.py", line 109, in xpath
    return self.selector.xpath(query)
  File "C:\Users\usr\AppData\Local\Continuum\Anaconda2\lib\site-packages\scrap
y\selector\unified.py", line 100, in xpath
    raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //*[contains(@id=,"result_"]
2016-06-03 08:50:11 [scrapy] INFO: Closing spider (finished)
2016-06-03 08:50:11 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1199,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 147875,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 6, 3, 12, 50, 11, 340000),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 3,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2016, 6, 3, 12, 50, 9, 310000)}
2016-06-03 08:50:11 [scrapy] INFO: Spider closed (finished)

如果您尝试使用命令

访问我想要在shell中搜索的页面
scrapy shell -s USER_AGENT="whatever" http://www.amazon.com/s/ref=sr_nr_scat_13896615011_2529580011_ln?srs=2529580011&rh=n%3A13896615011&ie=UTF8&qid=1464956176&scn=13896615011&h=4f68f64151566295c00f8b63baa57f9aa1cdeb07

并且您尝试查看(响应),您将看到scrapy看到的页面不是URL中的页面。

0 个答案:

没有答案