scrapy混合来自不同页面的项目字段

时间:2014-12-07 16:54:43

标签: python scrapy

除非我把CONCURRENT_REQUESTS_PER_DOMAIN = 1,否则我最终会从我抓取其他页面的属性的页面中获取项目。

我怀疑这可能来自我在parse_chapter部分中“手动”生成请求的事实,但我不确定并且想了解scrapy如何运作。

这是相关的部分代码

    rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a',
                           process_value=process_links), callback='parse_chapter'),
)


def parse_chapter(self, response):


    item = TogItem()
    item['chaptertitle'] = response.xpath('.//*[@id="chapter_num"]/text()').extract()

    pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])

    for p in range(1, pages + 1):
        page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
        print("page_absolute_url: {}".format(page_absolute_url))
        **yield Request(page_absolute_url, meta={'item': item}, callback=self.parse_pages, dont_filter=True)**

def parse_pages(self, response):

    item = response.request.meta['item']
    item['pagenumber'] = response.xpath('.//*[@id="chapter_page"]/text()').extract()
    print(item['pagenumber'])
    images = response.xpath('//*[@id="image"]/@src')
    images_absolute_url = []
    for ie in images:
        print("ie.extract(): {}".format(ie.extract()))
        images_absolute_url.append(urlparse.urljoin(response.url, ie.extract().strip()))

    print("images_absolute_url: {}".format(images_absolute_url))

    item['image_urls'] = images_absolute_url
    yield item

1 个答案:

答案 0 :(得分:3)

这是因为您要为所有页面发送项目的同一个实例(您在item = TogItem()上创建的parse_chapter)。

解决此问题的一种方法是在for循环中创建项目:

def parse_chapter(self, response):
    chaptertitle = response.xpath('.//*[@id="chapter_num"]/text()').extract()
    pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])

    for p in range(1, pages + 1):
        item = TogItem(chaptertitle=chaptertitle)

        page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')

        yield Request(page_absolute_url, meta={'item': item},
                      callback=self.parse_pages, dont_filter=True)