Question

强文本，基于我搜索的示例的代码似乎未达到预期的功能，因此我决定使用在github上找到的有效模型：https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py 然后，我对其进行了少许修改，以展示我遇到的问题。下面的代码可以按预期工作，但是我的最终目标是将抓取的数据从第一个“ parse”传递到第二个“ parse2”函数，以便可以合并2个不同页面中的数据。但是现在我想非常简单地开始，这样我就可以跟踪正在发生的事情，因此下面的代码十分繁琐。

# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
    'http://quotes.toscrape.com/',
    ]

def parse(self, response):
    item = MyItems()
    for quote in response.xpath('//div[@class="quote"]'):
            item['tinfo'] = 
quote.xpath('./span[@class="text"]/text()').extract_first()
            yield item 



but then when I modify the code as below:

# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

def parse(self, response):
    item = MyItems()
    for quote in response.xpath('//div[@class="quote"]'):
            item['tinfo'] =  
            quote.xpath('./span[@class="text"]/text()').extract_first()
            yield Request("http://quotes.toscrape.com/", 
    callback=self.parse2, meta={'item':item})

def parse2(self, response):
    item = response.meta['item']
    yield item

我只刮了一件，说其余都是重复的。看起来“ parse2”甚至根本没有被读取。我玩过缩进和方括号，以为我缺少一些简单的东西，但没有取得太大的成功。我查看了许多示例，以了解是否可以弄清楚可能是什么问题，但仍然无法使它起作用。我确信对于那里的大师们来说，这是一个非常简单的问题，所以我喊“帮助！”有人！

我的items.py文件也如下图所示，据我所知，这两个文件item.py和toscrape-xpath.py是唯一起作用的文件，因为我对这一切还很陌生。

# -*- coding: utf-8 -*-`enter code here`

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QuotesbotItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class MyItems(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    tinfo = scrapy.Field()
    pass

非常感谢您提供的所有帮助

# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
    'http://quotes.toscrape.com/',
    ]

def parse(self, response):
    item = MyItems()
    for quote in response.xpath('//div[@class="quote"]'):
            item = 
{'tinfo':quote.xpath('./span[@class="text"]/text()').extract_first()}
    **yield response.follow**('http://quotes.toscrape.com', self.parse_2, 
meta={'item':item})

def parse_2(self, response):
    print "almost there"
    item = response.meta['item']
    yield item

Answer 1

您的蜘蛛逻辑非常混乱：

def parse(self, response):
    for quote in response.xpath('//div[@class="quote"]'):
            yield Request("http://quotes.toscrape.com/", 
    callback=self.parse2, meta={'item':item})

对于在quotes.toscrape.com上找到的每个报价，您都将另一个请求安排到同一网页？发生的情况是，这些新的预定请求被scrapys重复请求过滤器过滤掉了。

也许您应该就在该位置出示物品：

def parse(self, response):
    for quote in response.xpath('//div[@class="quote"]'):
        item = MyItems()
        item['tinfo'] = quote.xpath('./span[@class="text"]/text()').extract_first()
        yield item

要说明为什么您当前的搜寻器什么都不做，请参见此图：

基于https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py的简单scrapy不使用yield请求传递数据

1 个答案: