匹配结果scrapy的html输出(跳过第一场比赛)

时间:2017-01-09 14:09:13

标签: python web-scraping scrapy

我已有scrapy代码,但在制定NEXT_PAGE_SELECTOR时遇到问题,我会在scrapy中通过css select选择元素:

def parse(self, response):
'''
        get the first page of results.
    '''
    SET_SELECTOR = 'b_algo'
    for bresult in response.css(SET_SELECTOR):
        NAME_SELECTOR = 'h2 a ::text'
        yield {
            'name': bresult.css(NAME_SELECTOR).extract_first(),
        }

    '''
        get the further pages of results.
    '''
    #<<NEXT_PAGE_SELECTOR here>>

我试图匹配的HTML是:

<ul class="sb_pagF" aria-label="More pages with results">
<li>
          <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
          </a>
</li>
</ul>

我已经制定了以下内容来匹配这个:

NEXT_PAGE_SELECTOR = '.sb_pagF li a ::attr(href)'

这样做是否正确抓住href

谢谢!

2 个答案:

答案 0 :(得分:3)

您可以随时在Scrapy Shell中测试您的选择器,将其指向您当地的html:

$ cat index.html
<ul class="sb_pagF" aria-label="More pages with results">
    <li>
        <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
        </a>
    </li>
</ul>
$ scrapy shell file://$PWD/index.html
In [1]: response.css(".sb_pagF li a ::attr(href)").extract_first()
Out[1]: u'/search?q=site%3asite.com&first=11&FORM=PORE'

答案 1 :(得分:3)

是的,这是正确的:

$ scrapy shell
In[1]: foo = """<ul class="sb_pagF" aria-label="More pages with results">
<li>
          <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
          </a>
</li>
</ul>"""
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=foo)
In [4]: sel.css('.sb_pagF li a ::attr(href)').extract()
Out[1]: [u'/search?q=site%3asite.com&first=11&FORM=PORE']