Question

我正在做一个带有scrapy的蜘蛛，如果我没有实现任何规则，但是现在我正在尝试实现一个规则来获取paginator并刮掉所有剩余的页面。但我不知道为什么我无法实现它。

蜘蛛代码：

    allowed_domains = ['guia.bcn.cat']
    start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']

rules = (
        Rule(SgmlLinkExtractor(allow=("index.php?pg=search&from=10&q=*:*&nr=10"),
        restrict_xpaths=("//div[@class='paginador']",))
        , callback="parse_item", follow=True),)

def parse_item(self, response)
...

另外，我试图在规则的allow参数中设置“index.php”，但都不起作用。

我在scrapy组中读到我没有输入“a /”或“a / @ href”，因为SgmlLinkExtractor会自动搜索链接。

控制台输出似乎运行良好，但没有得到任何东西。

有什么想法吗？

提前致谢

修改

使用此代码

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re

class BcnSpider(CrawlSpider):
    name = 'bcn'
    allowed_domains = ['guia.bcn.cat']
    start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']


rules = (
    Rule(
        SgmlLinkExtractor(
            allow=(re.escape("index.php")),
            restrict_xpaths=("//div[@class='paginador']")),
        callback="parse_item",
        follow=True),
)

def parse_item(self, response):
    self.log("parse_item")
    sel = Selector(response)
    i = BcncatItem()
    #i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
    #i['name'] = sel.xpath('//div[@id="name"]').extract()
    #i['description'] = sel.xpath('//div[@id="description"]').extract()
    return i

Answer 1

allow的{{1}}参数是正则表达式的（列表）。所以“？”，“*”和“。”被视为特殊字符。

您可以使用SgmlLinkExtractor（脚本开头的某处allow=(re.escape("index.php?pg=search&from=10&q=*:*&nr=10"))）

编辑：事实上，上述规则不起作用。但是，由于您已经拥有要提取链接的受限区域，因此可以使用import re

我无法使用scrapy上的规则获取数据

1 个答案: