Scrapy没有听取拒绝规则

时间:2013-12-11 01:06:31

标签: python web-crawler scrapy

由于某些原因,scrapy正在拒绝规则中的URL解析数据: 我正在从包含/ browse /,/ search /,/ ip /。的网址中获取解析数据 我不确定这是哪里出错的。

请指教,谢谢!请在下面找到我的代码:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "tp"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com",]

    """/tp/ page type to crawl"""

    rules = (Rule (SgmlLinkExtractor(allow=('/tp/', ),
        deny=(
            'browse/',
            'browse-ng.do?',
            'search-ng.do?',
            'facet=',
            'ip/',
            'page/'
            'search/',
            '/[1-9]$',
            '(bti=)[1-9]+(?:\.[1-9]*)?',
            '(sort_by=)[a-zA-Z]',
            '(sort_by=)[1-9]+(?:\.[1-9]*)?',
            '(ic=32_)[1-9]+(?:\.[1-9]*)?',
            '(ic=60_)[0-9]+(?:\.[0-9]*)?',
            '(search_sort=)[1-9]+(?:\.[1-9]*)?', )
            ,)
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['referer'] = response.request.headers.get('Referer')
            item['url'] = response.url
            item['title'] = site.xpath('/html/head/title/text()').extract()
            item['description'] = site.select('//meta[@name="Description"]/@content').extract()
            items.append(item)

        return items

我的控制台日志的一部分,它的grabing / ip / pages?:

2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/1104329> (referer: http://www.mydomain.com/tp/john-duigan)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/1104329>
    {'description': [u'Shop Low Prices on: Molly (Widescreen) : Movies'],
     'referer': 'http://www.mydomain.com/tp/john-duigan',
     'title': [u'Molly (Widescreen): Movies : mydomain.com '],
     'url': 'http://www.mydomain.com/ip/1104329'}
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/jon-furmanski>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/taylor-byrd>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/greg-byers>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/21152221> (referer: http://www.mydomain.com/tp/peter-levin)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/21152221>
    {'description': [u'Shop Low Prices on: Marva Collins Story (1981) : Video on Demand by VUDU'],
     'referer': 'http://www.mydomain.com/tp/peter-levin',
     'title': [u'Marva Collins Story (1981): Video on Demand by VUDU : mydomain.com '],
     'url': 'http://www.mydomain.com/ip/21152221'}

2 个答案:

答案 0 :(得分:4)

从网页中提取链接时,SgmlLinkExtractor的规则适用。在您的情况下,您的一些.../tp/...请求被重定向到.../ip/...个页面。

Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>

allowdeny模式在重定向后不适用于网址。

您可以通过将REDIRECT_ENABLED设置为False来完全禁用以下重定向(请参阅RedirectMiddleware

答案 1 :(得分:0)

我发现错误,页面重定向到我拒绝规则中的页面类型。谢谢你的帮助!我很感激!