我正试图从下面的标签中抓取“砖头银行”
<a href="/sets/10251-1/Brick-Bank"><span>10251: </span> Brick Bank</a>
下面有我的Scrapy Spider对象: 进口沙哑
class SpiderSpider(scrapy.Spider): #we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider.
name = 'spider'
allowed_domains = ['http://brickset.com/sets/year-2016']
start_urls = ['http://brickset.com/sets/year-2016/']
def parse(self, response): # in the html, css is the easier option and we find the.set and use for our selector
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
pass
NAME_SELECTOR = 'h1 a ::text'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}
但是我正在(如下所示)-'名称:10251',不是Brick-Bank?对此非常陌生,所以不确定原因-我正在跟踪返回正确名称的教程
2018-08-20 19:56:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://brick
set.com/sets/year-2016/>
{'name': '10251: '}
答案 0 :(得分:2)
这是css选择器的常见错误,其中空格对于选择器的实际结果非常重要。
'h1 a ::text'
这表示要从text
下的所有内部元素中获取a
,但是您需要的是:
'h1 a::text'
这表示仅从a
标记中获取文本元素