Question

我正试图从下面的标签中抓取“砖头银行”

<a href="/sets/10251-1/Brick-Bank"><span>10251: </span> Brick Bank</a>

下面有我的Scrapy Spider对象：进口沙哑

class SpiderSpider(scrapy.Spider): #we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. 
    name = 'spider'
    allowed_domains = ['http://brickset.com/sets/year-2016']
    start_urls = ['http://brickset.com/sets/year-2016/']

    def parse(self, response): # in the html, css is the easier option and we find the.set and use for our selector
        SET_SELECTOR = '.set'
        for brickset in response.css(SET_SELECTOR): 
            pass

            NAME_SELECTOR = 'h1 a ::text'
            yield {
                'name': brickset.css(NAME_SELECTOR).extract_first(),
            }

但是我正在（如下所示）-'名称：10251'，不是Brick-Bank？对此非常陌生，所以不确定原因-我正在跟踪返回正确名称的教程

 2018-08-20 19:56:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://brick
set.com/sets/year-2016/>
{'name': '10251: '}

Answer 1

这是css选择器的常见错误，其中空格对于选择器的实际结果非常重要。

'h1 a ::text'

这表示要从text下的所有内部元素中获取a，但是您需要的是：

'h1 a::text'

这表示仅从a标记中获取文本元素

Scrapy-抓取子元素-抓取CSS选择器的错误部分

1 个答案: