这是我的代码:
class MySpider(CrawlSpider):
name = "scraper"
allowed_domains = ["amazon.com"]
start_urls = ["http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011"]
rules = [Rule(SgmlLinkExtractor(allow=('.*?/\gp/\product.*?')),callback='parse_items',follow=True)]
def parse_items(self, response):
sel=Selector(response)
items = []
url=response.url
item = AmazonScraper()
print 'inside'
print sel.css('#btAsinTitle::text').extract()
item ["title"] = ''.join(sel.css('#btAsinTitle::text').extract())
print '-----',item["title"]
print response.url
item ["digitalprice"] = ''.join(sel.css('.digitalListPrice>.listprice::text').extract())
item["digitalprice"]=re.sub('\s+','',item["digitalprice"])
item ["listprice"] = ''.join(sel.css('.listPrice::text').extract())
item["listprice"]=re.sub('\s+','',item["listprice"])
item ["kindleprice"] = ''.join(sel.css('.priceLarge::text').extract())
item["kindleprice"]=re.sub('\s+','',item["kindleprice"])
if item["digitalprice"] != None and item["listprice"] != None and item["kindleprice"] != None:
items.append(item)
print items
return items
我得到的urls
也与regex
不匹配
这是为什么?我想抓取种子页面中的所有图书链接。
答案 0 :(得分:0)
正如我在评论中所建议的那样,也许看看你的正则表达式。
这是一个相当长的(通过链接的数量,我跳过其中一些)scrapy shell会话(来自法国,也许你的世界的回应不一样),它似乎取得了相当的很多产品链接:
paul@paul-SATELLITE-R830:~$ scrapy shell "http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011" --set USER_AGENT="Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
2014-06-20 12:58:05+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
...
2014-06-20 12:58:06+0200 [default] INFO: Spider opened
2014-06-20 12:58:08+0200 [default] DEBUG: Crawled (200) <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f6ec6fb4310>
[s] item {}
[s] request <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s] response <200 http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s] sel <Selector xpath=None data=u'<html>\n <head>\n <meta http-equ'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7f6ec6740590>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=('.*?/\gp/\product.*?',))
In [3]: import pprint
In [4]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
'http://www.amazon.com/gp/product/B007HCCNJU/ref=topnav_storetab_kstore/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00FL3YL7O/ref=amb_link_410918762_2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1775973302&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-top-1&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00GL3MGTI/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00GL3MGTI/ref=s9_al_bw_rs1/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00HWI5OP4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00HWI5OP4/ref=s9_al_bw_rs2/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B009NF6Z2K/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B009NF6Z2K/ref=s9_al_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
...
'http://www.amazon.com/gp/product-reviews/B00DN7BAUG/ref=s9_hps_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00A7H2CFW/ref=s9_hps_bw_g351_i4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00A7H2CFW/ref=s9_hps_bw_rs4/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_t1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_t2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_t3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DZQE2Y6/ref=amb_link_410240162_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']
In [5]: lx = SgmlLinkExtractor(allow=('/gp/product/',))
In [6]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
...
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']
In [7]: len([link.url for link in lx.extract_links(response)])
Out[7]: 106
所以我得到了106 /gp/product/
个链接,而你的正则表达式只有185个。