除非我把CONCURRENT_REQUESTS_PER_DOMAIN = 1,否则我最终会从我抓取其他页面的属性的页面中获取项目。
我怀疑这可能来自我在parse_chapter部分中“手动”生成请求的事实,但我不确定并且想了解scrapy如何运作。
这是相关的部分代码
rules = (
Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a',
process_value=process_links), callback='parse_chapter'),
)
def parse_chapter(self, response):
item = TogItem()
item['chaptertitle'] = response.xpath('.//*[@id="chapter_num"]/text()').extract()
pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])
for p in range(1, pages + 1):
page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
print("page_absolute_url: {}".format(page_absolute_url))
**yield Request(page_absolute_url, meta={'item': item}, callback=self.parse_pages, dont_filter=True)**
def parse_pages(self, response):
item = response.request.meta['item']
item['pagenumber'] = response.xpath('.//*[@id="chapter_page"]/text()').extract()
print(item['pagenumber'])
images = response.xpath('//*[@id="image"]/@src')
images_absolute_url = []
for ie in images:
print("ie.extract(): {}".format(ie.extract()))
images_absolute_url.append(urlparse.urljoin(response.url, ie.extract().strip()))
print("images_absolute_url: {}".format(images_absolute_url))
item['image_urls'] = images_absolute_url
yield item
答案 0 :(得分:3)
这是因为您要为所有页面发送项目的同一个实例(您在item = TogItem()
上创建的parse_chapter
)。
解决此问题的一种方法是在for循环中创建项目:
def parse_chapter(self, response):
chaptertitle = response.xpath('.//*[@id="chapter_num"]/text()').extract()
pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])
for p in range(1, pages + 1):
item = TogItem(chaptertitle=chaptertitle)
page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
yield Request(page_absolute_url, meta={'item': item},
callback=self.parse_pages, dont_filter=True)