我不得不做一个刮刀,我不明白为什么它不起作用......
该网站有这样的分页:
<div class="pagination toolbarbloc">
<ul>
<li class="active"><span>1</span></li>
<li><a href="...">2</a></li>
<li><a href="...">3</a></li>
<li><a href="...">4</a></li>
<li><a href="...">5</a></li>
<li><a class="end" href="...">>></li>
</ul>
</div>
班级&#34;活跃&#34;当你走到下一页时移动,所以在第5页,它是balise&#34; li&#34;就在最后一个让班级活跃的人面前! 在balise&#34; li&#34;之后我抓住了这个项目。与班级#34;活跃&#34;那样:
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
它完美地适用于5首页......但它并不适用于第6页抓住应答器&#34; a&#34;与班级结束......
我试试:
try:
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
except (ValueError,IndexError):
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'
next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
有人有想法吗? :) 感谢您的帮助!
答案 0 :(得分:0)
import codecs
from lxml import etree
test_xml = """<div class="pagination toolbarbloc">
<ul>
<li class="active"><span>1</span></li>
<li><a href="1href">2</a></li>
<li><a href="2href">3</a></li>
<li><a href="3href">4</a></li>
<li><a href="4href">5</a></li>
<li><a class="end" href="5href">>></li>
</ul>
</div>"""
tree = etree.HTML(test_xml)
rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')
print rep
# ['1href', '2href', '3href', '4href', '5href']
我想知道我是否完全明白你所说的话。如果你真的想要这样的python函数,也许它可以帮助你。