如何将Scrapy规则设置为仅解析/浏览/页面?

时间:2014-01-29 02:17:50

标签: python web-crawler scrapy

我正在尝试抓取网站并仅解析/浏览/页面,但scrapy似乎正在解析其他页面类型,例如/ ip /。我在下面复制了我的代码和控制台日志。 我和我的规则有问题。基本上我想抓取整个网站,每个页面类型,但只解析带有浏览/页面的URL。

以下是我的日志:

2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877>
    {'canonical': [u'http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6',
     'title': [u"Danksin Now Women's Maternity Microfleece Hoodie: Maternity : mydomain.com "],
     'url': 'http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877'}
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820>
    {'canonical': [u'http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6',
     'title': [u'Loving Moments by Leading Lady Maternity Adjustable Legging: Maternity : mydomain.com '],
     'url': 'http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820'}
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6)
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Danskin-Now-Maternity-Performance-Jacket/32360420> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6)
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736>
    {'canonical': [u'http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6',
     'title': [u'Danskin Now Maternity Microfleece Pants, 2-Pack Value Bundle: Maternity : mydomain.com '],
     'url': 'http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736'}

以下是我的代码:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from wallspider.items import Website
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class Myspider(CrawlSpider):
    name = "newbrowsepages"
    allowed_domains = ["mydomain.com"]
    start_urls = ["http://www.mydomain.com/"]

    rules = (
    Rule(SgmlLinkExtractor(allow=('/browse/', ),)
    , callback='parse_links', follow= True, process_links=lambda links: [link for link in links if not link.nofollow],),
    Rule(SgmlLinkExtractor(allow=('/browse/', ),deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', 'facet=' ))),
    )

    def parse_start_url(self, response):
        list(self.parse_links(response))

    def parse_links(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        domain = 'http://www.mydomain.com'
        for link in links:
            class_text = ''.join(link.select('./@class').extract())
            title = ''.join(link.select('./@class').extract())
            url = ''.join(link.select('./@href').extract())
            meta = {'title':title,}
            meta = {'class_text':class_text,}
            yield Request(domain+url, callback = self.parse_page, meta=meta,)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        item = Website()
        for site in sites:
            item['class_text']=response.meta['class_text']
            item['url'] = response.url
            item['title'] = site.xpath('/html/head/title/text()').extract()
            item['referer'] = response.request.headers.get('Referer')
            item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()

        return item

更新日志:

2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436>
    {'canonical': [u'http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102',
     'title': [u"Dorel Twin-Over-Full Metal Black Bunk Bed with Optional Mattresses: Kids' & Teen Rooms : mydomain.com "],
     'url': 'http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436'}
2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913>
    {'canonical': [u'http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102',
     'title': [u'Shop for the Mainstays Twin Over Twin Wood Bunk Bed at mydomain.com. Save money. Live better.'],
     'url': 'http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913'}
2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418>
    {'canonical': [u'http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418'],
     'class_text': '',
     'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102',
     'title': [u'Student Task Chair with Arms - mydomain.com'],
     'url': 'http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418'}
2014-01-28 18:44:22-0800 [newbrowsepages] INFO: Crawled 92 pages (at 92 pages/min), scraped 11 items (at 11 items/min)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L127&search_sort=6)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/boys-shoes/5438_1045804_1045805_624079> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/jewelry/jewelry-storage/3891_132987/?amp;ic=48_0&amp;ref=125876.183604&amp;tab_value=10874_All&catNavId=3891&povid=P1171-C1110.2784+1455.2776+1115.2956-L147> (referer: http://www.mydomain.com/)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/girls-shoes/5438_1045804_1045805_605881> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/baby-toddler-shoes/5438_1045804_1045805_587407> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/South-Shore-Smart-Basics-3-Drawer-Chest-Chocolate/12480393> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/South-Shore-Country-Double-Dresser-Cream/3921886> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102)
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Mainstays-Twin-Platform-Bed-with-Headboard-Cinnamon-Cherry/23735992> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102)
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/handbags/5438_1045799_1045800_163873/?_refineresult=true&povid=cat661959-env498314-moduleB020613-lLinkPOV2_Handbags> (referer: http://www.mydomain.com/browse/apparel/bags/5438_1045799/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L135)
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/baby/cribs/5427_414099_1101429> (referer: http://www.mydomain.com/browse/baby/5427/?_refineresult=true&facet=customer_rating%3A4+-+5+Stars&povid=P1171-C1110.2784+1455.2776+1115.2956-L148)
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Charleston-Storage-Loft-Bed-with-Desk-White/12338217> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102)

1 个答案:

答案 0 :(得分:1)

如果混合使用2条规则会怎样?

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=('/browse/', ),
            deny=('/[1-9]$',
                  '(bti=)[1-9]+(?:\.[1-9]*)?',
                  '(sort_by=)[a-zA-Z]',
                  '(sort_by=)[1-9]+(?:\.[1-9]*)?',
                  '(ic=32_)[1-9]+(?:\.[1-9]*)?',
                  '(ic=60_)[0-9]+(?:\.[0-9]*)?',
                  '(search_sort=)[1-9]+(?:\.[1-9]*)?',
                  'browse-ng.do\?',
                  '/page/',
                  '/ip/',
                  'out\+value',
                  'fn=',
                  'customer_rating',
                  'special_offers',
                  'search_sort=&',
                  'facet='),
        ),
        follow=True,
        process_links=lambda links: [
            link for link in links if not link.nofollow],
        callback='parse_page'),
)