我编写了一个网络抓取工具,它从起始网址中提取绝对链接,并继续访问网域内的绝对链接,直到停止为止。 Scrapy自动不会关注重复链接。爬虫工作。
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://homeguide.ph/']
allowed_domains = ['homeguide.ph']
# for the initial visit
def parse(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
if link.find("#") == -1:
yield scrapy.Request(link, callback=self.parse_link)
# for subsequent visits
def parse_link(self, response):
self.logger.info("Visited %s", response.url)
links = response.xpath('//a/@href').extract()
for link in links:
if link.find("#") == -1: # visit only absolute links
yield scrapy.Request(link, callback=self.parse_link)
但是,我觉得它可以改进。我不知道怎么做。有没有办法改进这个爬虫?