我使用scrapy来蜘蛛整个网站(allow_domains = mydomain.com)。现在我想从当前URL获取所有外部链接(到另一个域)。如何在spider.py中集成它以获取包含所有外部URL的列表?
答案 0 :(得分:1)
尝试使用Link Extractors。这可以是一个例子:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['my-domain.com']
start_urls = ['http://www.my-domain.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item