Scrapy获取URL的所有外部链接

时间:2014-12-23 09:17:33

标签: hyperlink scrapy external

我使用scrapy来蜘蛛整个网站(allow_domains = mydomain.com)。现在我想从当前URL获取所有外部链接(到另一个域)。如何在spider.py中集成它以获取包含所有外部URL的列表?

1 个答案:

答案 0 :(得分:1)

尝试使用Link Extractors。这可以是一个例子:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url= Field()


class MySpider(CrawlSpider):
    name = 'twitter.com'
    allowed_domains = ['my-domain.com']
    start_urls = ['http://www.my-domain.com']

    rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )

    def parse_url(self, response):
        item = MyItem()
        item['url'] = response.url
        return item