我试图了解linkextractor如何在Scrapy中工作。 我想要完成的任务:
按照起始页上的分页
搜索网址并扫描模式中的所有链接
在找到的链接页面中,按照该页面上与该模式匹配的其他链接并删除该页面
我的代码:
class ToScrapeMyspider(CrawlSpider):
name = "myspider"
allowed_domains = ["myspider.com"]
start_urls = ["www.myspider.com/category.php?k=766"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//link[@rel="next"]/a'), follow=True),
Rule(LinkExtractor(allow=r"/product.php?p=\d+$"), callback='parse_spider')
)
def parse_spider(self, response):
Request(allow=r"/product.php?e=\d+$",callback=self.parse_spider2)
def parse_spider2(self, response):
#EXTRACT AND PARSE DATA HERE ETC (IS WORKING)
我的分页链接如下:
<link rel="next" href="https://myspider.com/category.php?k=766&amp;s=100" >
首先我从restrict_xpaths
收到错误'str' object has no attribute 'iter'
但我想我搞砸了事情
答案 0 :(得分:1)
最后工作:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@rel="next"]',)), follow=True),
Rule(LinkExtractor(allow=('product\.php', )), callback='parse_sider'),
)
BASE_URL = 'https://myspider.com/'
def parse_spy(self, response):
links = response.xpath('//li[@id="id"]/a/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_spider2)