我的SGML Link提取器与scrapy中的正则表达式不匹配

时间:2014-06-20 05:08:15

标签: python web-scraping scrapy

这是我的代码:

class MySpider(CrawlSpider):
    name = "scraper"
    allowed_domains = ["amazon.com"]
    start_urls = ["http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011"]   

    rules = [Rule(SgmlLinkExtractor(allow=('.*?/\gp/\product.*?')),callback='parse_items',follow=True)]

def parse_items(self, response):

    sel=Selector(response)
    items = []
    url=response.url
    item = AmazonScraper()
    print 'inside'
    print sel.css('#btAsinTitle::text').extract()
    item ["title"] = ''.join(sel.css('#btAsinTitle::text').extract())
    print '-----',item["title"]
    print response.url
    item ["digitalprice"] = ''.join(sel.css('.digitalListPrice>.listprice::text').extract())
    item["digitalprice"]=re.sub('\s+','',item["digitalprice"])
    item ["listprice"] = ''.join(sel.css('.listPrice::text').extract())
    item["listprice"]=re.sub('\s+','',item["listprice"])
    item ["kindleprice"] = ''.join(sel.css('.priceLarge::text').extract())
    item["kindleprice"]=re.sub('\s+','',item["kindleprice"])


    if item["digitalprice"] != None and item["listprice"] != None and item["kindleprice"] != None:
        items.append(item)

    print items

    return items

我得到的urls也与regex不匹配 这是为什么?我想抓取种子页面中的所有图书链接。

1 个答案:

答案 0 :(得分:0)

正如我在评论中所建议的那样,也许看看你的正则表达式。

这是一个相当长的(通过链接的数量,我跳过其中一些)scrapy shell会话(来自法国,也许你的世界的回应不一样),它似乎取得了相当的很多产品链接:

paul@paul-SATELLITE-R830:~$ scrapy shell "http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011" --set USER_AGENT="Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
2014-06-20 12:58:05+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
...
2014-06-20 12:58:06+0200 [default] INFO: Spider opened
2014-06-20 12:58:08+0200 [default] DEBUG: Crawled (200) <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f6ec6fb4310>
[s]   item       {}
[s]   request    <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s]   response   <200 http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s]   sel        <Selector xpath=None data=u'<html>\n    <head>\n        <meta http-equ'>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x7f6ec6740590>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  
In [2]: lx = SgmlLinkExtractor(allow=('.*?/\gp/\product.*?',))
In [3]: import pprint
In [4]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
 'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
 'http://www.amazon.com/gp/product/B007HCCNJU/ref=topnav_storetab_kstore/181-5939241-1829655',
 'http://www.amazon.com/gp/product/B00FL3YL7O/ref=amb_link_410918762_2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1775973302&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-top-1&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00GL3MGTI/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product-reviews/B00GL3MGTI/ref=s9_al_bw_rs1/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
 'http://www.amazon.com/gp/product/B00HWI5OP4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product-reviews/B00HWI5OP4/ref=s9_al_bw_rs2/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
 'http://www.amazon.com/gp/product/B009NF6Z2K/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product-reviews/B009NF6Z2K/ref=s9_al_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
 ...
 'http://www.amazon.com/gp/product-reviews/B00DN7BAUG/ref=s9_hps_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
 'http://www.amazon.com/gp/product/B00A7H2CFW/ref=s9_hps_bw_g351_i4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101',
 'http://www.amazon.com/gp/product-reviews/B00A7H2CFW/ref=s9_hps_bw_rs4/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
 'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_t1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_t2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_t3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00DZQE2Y6/ref=amb_link_410240162_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']

In [5]: lx = SgmlLinkExtractor(allow=('/gp/product/',))

In [6]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
 'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
 ...
 'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
 'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']

In [7]: len([link.url for link in lx.extract_links(response)])
Out[7]: 106

所以我得到了106 /gp/product/个链接,而你的正则表达式只有185个。