我正在尝试通过使用Python中的Scrapy库从最高社区网站上获取标题,价格和赞成/反对投票的统计信息。
import scrapy
class SupremeSpider(scrapy.Spider):
name = "Supreme"
start_urls = [
'https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/'
]
def parse(self, response):
for data in response.css('div.card-details'):
yield {
'title': data.xpath("//h2/text()").getall(),
'price': data.css('span.label-price::text').get()
#'upvotes': data.xpath("//p/text()").getall()
#'downvotes': quote.css('div.tags a.tag::text').getall(),
}
当我在CMD中运行抓取抓取Supreme时
结果如下:
2019-02-27 14:19:09 [scrapy.core.scraper]调试:从<200刮下来 https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/> {'title':['Airbrushed Floral Skateboard','Formula Crewneck', 'Supreme®/ MasterLock®数字组合锁','Supreme®/ SIGG™CYD 1.0升水瓶”,“腰包”,“爬行者T恤”,“粉碎T恤”,“ FREE GIFT 浴帽”,“ Christopher Walken纽约之王T恤”,“盘子” 毛巾(三件套)”,“金属打火机皮套”,“粘结徽标浮肿” 夹克”,“单肩包”,“雪尼尔连帽运动衫”,“背包”, “套色无檐小便帽”,“水果T恤”,“结T恤”,“整理袋”, 'Supreme®/Hanes®豹纹平角内裤(2件装)','行李袋',' Real Shit L / S Tee”,“ Red Rum棒球服”,“Supreme®/Hanes®拳击手” 内裤(4件装)”,“儿童T恤”,“玩具Uzi充气枕头”,“苹果” 连帽运动衫”,“ Spotlight钥匙扣”,“Supreme®/Hanes®船员袜” (4包)”,“带缝线夹克”,“前三通”,“水果滑板”, “ Hard Goods Tee”,“ Leda And The Swan Tee”,“ Military Camp Cap”, 'Leather Varsity Jacket','Patchwork Harrington Jacket','Formula Sweatpant”,“Supreme®/Hanes®无标签T恤(3件装)”,“ I Make Shit Shi” Happen Pin”,“ Leda和天鹅滑板”,“ Sin Tee原创”, “ Clouds L / S上衣”,“赛车徽标工作衬衫”,“真丝迷彩衬衫”, “自由女神吊坠”,“色欲陶瓷盒子”,“管道” 夹克”,“拼布马海毛开襟衫”,“Supreme®/Hanes®豹纹无标签” “ T恤(2包)”,“徽标徽标连帽套头运动衫”,“Supreme®/Spitfire®” 经典车轮(4个一组)”,“世界三通中指”,“ S / S” Pocket Tee”,“Supreme®/Independent®Truck”,“ GORE-TEX S-Logo 6-Panel”, 'Tag Logo Sweater','Tech L / S Tee','Shears Hooded Sweatshirt', 'Patchwork Cargo Pant','Stone Washed Slim Jean','Text Stripe New Era®”,“模糊绒卡车司机夹克”,“ D环风衣”,“多” 条纹S / S上衣”,“管道裤”,“工作裤”,“标签徽标豆豆”, 'Corduroy Compact Logo 6-Panel','Oxford Shirt','Set In Logo 运动裤”,“水洗黑色修身牛仔裤”,“罗斯布法罗格子布” 衬衫”,“拼布钟帽”,“佩斯利条纹L / S上衣”,“模糊绒毛” 短裤”,“扎染防撕裂露营帽”,“缝带裤”,“定期清洗” 吉恩(Jean),“刚性修身吉恩(Jigid Slim Jean)”,“世界5个面板”,“签名脚本徽标训练营” Cap','Motherfucker 6-Panel'],'price':'\ n
$ 48 /£46 \ n
'}
试图使格式看起来像这样:
{title:喷绘花卉滑板,价格:$ 48 /£46,赞成票:14218,赞成票:1034}
答案 0 :(得分:1)
使用嵌套选择器时,您需要使用适当的相对XPath,否则它将从 entire 响应中提取:
'title': data.xpath(".//h2/text()").get(),
请参阅文档:https://docs.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths