Scrapy返回多个项目

时间:2017-10-04 18:14:26

标签: python web-scraping scrapy scrapy-spider

我是Scrapy的新手,我真的只是迷失了如何在一个区块中返回多个项目。

基本上,我得到一个HTML标签,其引用包含文本,作者姓名和有关该引用的一些标签的嵌套标签。

这里的代码只返回一个引用,就是这样。它不会使用循环返回其余部分。我一直在网上搜索几个小时,我只是没希望我没有得到它。到目前为止,这是我的代码:

Spider.py

import scrapy
from scrapy.loader import ItemLoader
from first_spider.items import FirstSpiderItem

class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
    l = ItemLoader(item = FirstSpiderItem(), response=response)

    quotes = response.xpath("//*[@class='quote']")

    for quote in quotes:
        text = quote.xpath(".//span[@class='text']/text()").extract_first()
        author = quote.xpath(".//small[@class='author']/text()").extract_first()
        tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first()

        # removes quotation marks from the text
        for c in ['“', '”']:
            if c in text:
                text = text.replace(c, "")

        l.add_value('text', text)
        l.add_value('author', author)
        l.add_value('tags', tags)
        return l.load_item()

    next_page_path = 
    response.xpath(".//li[@class='next']/a/@href").extract_first()

    next_page_url = response.urljoin(next_page_path)
    yield scrapy.Request(next_page_url)

Items.py

import scrapy

class FirstSpiderItem(scrapy.Item):

text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

这是我要抓的页面:

Link

2 个答案:

答案 0 :(得分:4)

我也在寻找同样问题的解决方案。以下是我找到的解决方案:

def parse(self, response):
    for selector in response.xpath("//*[@class='quote']"):
        l = ItemLoader(item=FirstSpiderItem(), selector=selector)
        l.add_xpath('text', './/span[@class="text"]/text()')
        l.add_xpath('author', '//small[@class="author"]/text()')
        l.add_xpath('tags', './/meta[@class="keywords"]/@content')
        yield l.load_item()

    next_page = response.xpath(".//li[@class='next']/a/@href").extract_first()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

要从文字中删除引号,您可以在 items.py 中使用输出处理器。

from scrapy.loader.processors import MapCompose

def replace_quotes(text):
    for c in ['“', '”']:
        if c in text:
            text = text.replace(c, "")
    return text

class FirstSpiderItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field(output_processor=MapCompose(replace_quotes))

请告诉我这是否有帮助。

答案 1 :(得分:1)

试一试。它将为您提供您想要抓取的所有数据。

import scrapy

class QuotesSpider(scrapy.Spider):

    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.xpath("//*[@class='quote']"):
            text = quote.xpath(".//span[@class='text']/text()").extract_first()
            author = quote.xpath(".//small[@class='author']/text()").extract_first()
            tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first()
            yield {"Text":text,"Author":author,"Tags":tags}

        next_page = response.xpath(".//li[@class='next']/a/@href").extract_first()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url)