Scrapy SgmlLinkExtractor添加任意URL

时间:2011-11-20 15:09:04

标签: python scrapy scrape

如何向SgmlLinkExtractor添加网址?也就是说,如何添加任意URL来运行回调?

详细说明,使用dirbot作为示例:https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/googledir.py

parse_category 仅访问与SgmlLinkExtractor SgmlLinkExtractor匹配的所有内容(allow ='directory.google.com / [A-Z] [a-zA-Z _ /] + $')

2 个答案:

答案 0 :(得分:0)

使用BaseSpider而不是CrawlSpider,然后设置add to start_requests或start_urls []

class MySpider(BaseSpider):
    name = "myspider"

    def start_requests(self):
        return [Request("https://www.example.com",
            callback=self.parse)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...

答案 1 :(得分:0)

类ThemenHubSpider(CrawlSpider):

name = 'themenHub'
allowed_domains = ['themen.t-online.de']
start_urls = ["http://themen.t-online.de/themen-a-z/a"]
rules = [Rule(SgmlLinkExtractor(allow=['id_\d+']), 'parse_news')]