防止scrapy跟随某些网页的链接

时间:2013-08-15 08:48:04

标签: python scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from aibang.items import OrgItem

class OrgSpider(CrawlSpider):
  name = "org"
  allowed_domains = ["demo-site.com"]
  start_urls = [
      'http://demo-site.com/detail/17507640-419823665'
  ]

  rules = ( 
      # Item List
      Rule(SgmlLinkExtractor(allow=(r'list\/\d+$', ))),
      # Parse item
      Rule(SgmlLinkExtractor(allow=(r'detail\/\d+-\d+$', )), callback='parse_item', follow=False),
  )

  def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    item = OrgItem()
    try:
      item['name'] = hxs.select('//div[@class="b_title"]/h1/text()')[0].extract()
    except:
      print 'Something goes wrong, skip it'
    print item['name']

我正在使用Scrapy抓取某些网页,但我不希望它遵循detail/xxx-xxx页面中的链接,如何禁用它?

我添加了follow=False,但它不起作用,它仍然遵循detail/xxx-xxx内的链接。

======注意======

我仍然需要从detail page抓取list page,但在另一个detail page内抓取detail page

1 个答案:

答案 0 :(得分:0)

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
  

deny(正则表达式(或列表)) - 单个正则表达式   (绝对)URL必须匹配的(或正则表达式列表)   为了被排除(即未被提取)。它优先于   允许参数。如果没有给出(或为空),它将不排除任何   链接。

我希望它适合你