我已有scrapy
代码,但在制定NEXT_PAGE_SELECTOR
时遇到问题,我会在scrapy
中通过css select选择元素:
def parse(self, response):
'''
get the first page of results.
'''
SET_SELECTOR = 'b_algo'
for bresult in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h2 a ::text'
yield {
'name': bresult.css(NAME_SELECTOR).extract_first(),
}
'''
get the further pages of results.
'''
#<<NEXT_PAGE_SELECTOR here>>
我试图匹配的HTML是:
<ul class="sb_pagF" aria-label="More pages with results">
<li>
<a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&first=11&FORM=PORE">
<div class="sw_next">Next
</div>
</a>
</li>
</ul>
我已经制定了以下内容来匹配这个:
NEXT_PAGE_SELECTOR = '.sb_pagF li a ::attr(href)'
这样做是否正确抓住href
?
谢谢!
答案 0 :(得分:3)
您可以随时在Scrapy Shell中测试您的选择器,将其指向您当地的html:
$ cat index.html
<ul class="sb_pagF" aria-label="More pages with results">
<li>
<a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&first=11&FORM=PORE">
<div class="sw_next">Next
</div>
</a>
</li>
</ul>
$ scrapy shell file://$PWD/index.html
In [1]: response.css(".sb_pagF li a ::attr(href)").extract_first()
Out[1]: u'/search?q=site%3asite.com&first=11&FORM=PORE'
答案 1 :(得分:3)
是的,这是正确的:
$ scrapy shell
In[1]: foo = """<ul class="sb_pagF" aria-label="More pages with results">
<li>
<a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&first=11&FORM=PORE">
<div class="sw_next">Next
</div>
</a>
</li>
</ul>"""
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=foo)
In [4]: sel.css('.sb_pagF li a ::attr(href)').extract()
Out[1]: [u'/search?q=site%3asite.com&first=11&FORM=PORE']