Question

我一直试图让Scrapy的Linkextractor工作，但没有用。我想让它找到任何链接，然后调用一个不同的方法，只是打印出一些东西，以显示它正在工作。

这是我的蜘蛛：

from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor


class TestSpider(CrawlSpider):
    name = 'spi'
    allowed_domains = ['https://www.reddit.com/']
    start_urls = ['https://www.reddit.com/']

    rules = [
        Rule(LinkExtractor(allow=()),
             callback='detail', follow=True)
    ]

    def parse(self, response):
        print("parsed!")

    def detail(self, response):
        print('parsed detail!')

当我使用命令“scrapy crawl spi”运行蜘蛛时：我得到“解析！”，所以它只进入解析函数，而不是细节方法。

Answer 1

如果您正在为蜘蛛使用CrawlSpider基类，请避免使用parse方法，因为它会破坏处理。阅读documentation。

中的警告

Answer 2

theres无需注释掉解析...但是更改为默认值parse_item ...或者你喜欢什么！重点是，parse是一个已经在Crawl spider中的逻辑函数..

将来使用＆＃34; ... genspider等等＃34; 尝试＆＃34; scrapy genspider -t抓取SPIDERNAME BASEURL（没有http / s：// www .... IE = site.com）＆＃34;

Scrapy Linkextractor或Rule无效

2 个答案: