如何限制应用LinkExtractor的区域?

时间:2015-05-06 10:16:41

标签: scrapy

我有一个带有以下规则的刮刀:

rules = (
  Rule(LinkExtractor(allow=('\S+list=\S+'))),
  Rule(LinkExtractor(allow=('\S+list=\S+'))),
  Rule(LinkExtractor(allow=('\S+view=1\S+')), callback='parse_archive'),
)

如您所见,第2和第3条规则完全相同。

我想要做的是通过仅引用页面中的特定位置来告诉我们感兴趣的链接。为方便起见,我发送了相应的XPath,尽管我更喜欢基于BeatifullSoup语法的解决方案。

//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/table/tbody/tr/td[1]

//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[1]

//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[2]

修改

让我举个例子。让我们假设我想在Scrapy的官方页面上提取五个(六个中)链接:

enter image description here

这是我的蜘蛛。有什么想法吗?

class dmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["scrapy.org"]
    start_urls = [
        "http://scrapy.org/",
    ]
    rules = (
        Rule(LinkExtractor(allow=('\S+/'), restrict_xpaths=('/html/body/div[1]/div/ul')), callback='first_level'),
    )
    def first_level(self, response):
        taco = dmozItem()
        taco['basic_url'] = response.url
        return taco

1 个答案:

答案 0 :(得分:2)

可以使用restrict_xpaths参数完成此操作。请参阅LxmlLinkExtractor documentation

修改

您还可以将列表传递给restrict_xpaths

编辑2:

应该有效的完整示例:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class dmozItem(scrapy.Item):
    basic_url = scrapy.Field()

class dmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["scrapy.org"]
    start_urls = [
        "http://scrapy.org/",
    ]

    def clean_url(value):
        return value.replace('/../', '/')

    rules = (
        Rule(
            LinkExtractor(
                allow=('\S+/'),
                restrict_xpaths=(['.//ul[@class="navigation"]/a[1]',
                                  './/ul[@class="navigation"]/a[2]',
                                  './/ul[@class="navigation"]/a[3]',
                                  './/ul[@class="navigation"]/a[4]',
                                  './/ul[@class="navigation"]/a[5]']),
                process_value=clean_url
            ),
            callback='first_level'),
    )

    def first_level(self, response):
        taco = dmozItem()
        taco['basic_url'] = response.url
        return taco