如何在Scrapy中提取嵌套文本?

时间:2017-08-29 00:16:49

标签: python scrapy

我正在尝试使用Scrapy在本网站上提取一段品牌描述: http://us.asos.com/hope-and-ivy/hope-ivy-dotty-mesh-midi-dress-with-ruffle-detail/prd/8663409?clr=black&cid=2623&pgesize=36&pge=0&totalstyles=627&gridsize=3&gridrow=1&gridcolumn=1

HTML元素如下所示:

<div class="brand-description">
  <h4>Brand</h4>
  <span>"Prom queens and wedding guests, claim the best-dressed title in "
    <a href="/Women/A-To-Z-Of-Brands/Hope-And-Ivy/Cat/pgecategory.aspx?cid=21368">
      <strong>"Hope and Ivy's"</strong>
    </a> 
    "occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."
  </span>
</div>

我想要的结果是:

“舞会女王和婚礼嘉宾,在Hope和Ivy的场合系列中获得最佳着装冠军。购买其通知我的手绘花卉款式,Bardot领口和修身铅笔连衣裙。”

我试过这个方法:

response.css("div.brand-description span::text").extract()

然而,我得到的文本列表中缺少“强”标签内的那些,即“希望和常春藤”:

['Prom queens and wedding guests, claim the best-dressed title in ',  ' occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses.']

我的问题是,我可以在不注意“href”标签的情况下获得纯文本吗?

1 个答案:

答案 0 :(得分:2)

您可能还需要进行一些后期处理,但这可能是您可以做的最好的事情:

response.xpath('normalize-space(//div[@class="brand-description"]/span)').extract_first()

会给你

u'"Prom queens and wedding guests, claim the best-dressed title in " "Hope and Ivy\'s" "occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."'