如果条件为真,Scrapy将获得href值的值

时间:2017-05-06 16:55:36

标签: python html css scrapy

我用这个html内容抓了一页:



<div class="td-ss-main-content">
  <div class="td-page-header">...</div>
  <div class="td_module_16 td_module_wrap td-animation-stack">...</div>
  <div class="td_module_16 td_module_wrap td-animation-stack td_module_no_thumb">...</div>
  <div class="page-nav td-pb-padding-side">
    <span class="current">1</span>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-right"></i></a>
    <span class="pages">Page 1 of 3</span>
  </div>
</div>
&#13;
&#13;
&#13;

现在我想获得下一页链接,如果它的当前值是.page-nav > a的{​​{1}},其i tag

我可以这样做:

response.css("div.page-nav > a")[2].css("::attr(href)").extract_first()

但是,如果我在第2页,这将无法工作。因此如果a tag的子元素具有i tag,则最好获得<div class="page-nav td-pb-padding-side"> <a href="http://www.arunachaltimes.in/2017/05/06/"><i class="td-icon-menu-left"></i></a> <a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a> <span class="current">2</span> <a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a> <a href="http://www.arunachaltimes.in/2017/05/06/page/3/"><i class="td-icon-menu-right"></i></a> <span class="pages">Page 2 of 3</span> </div> 的值。我怎样才能做到这一点?

更新(第2页)

<div class="page-nav td-pb-padding-side">
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-left"></i></a>
    <a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a>
    <a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a>
    <span class="current">3</span>
    <span class="pages">Page 3 of 3</span>
</div>

更新(第3页最后一页)

{{1}}

1 个答案:

答案 0 :(得分:2)

您可以使用XPath表达式实现它:

//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href

请注意,为避免误报,我们使用的是concat for the class attribute check

演示:

$ scrapy shell file:////$PWD/index.html
In [1]: response.xpath("//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href").extract_first()
Out[1]: u'http://www.arunachaltimes.in/2017/05/06/page/2/'