分页level2 - scrapy python

时间:2017-09-26 00:48:29

标签: python xpath scrapy

我不得不做一个刮刀,我不明白为什么它不起作用......

该网站有这样的分页:

<div class="pagination toolbarbloc">
        <ul>
                <li class="active"><span>1</span></li>
                <li><a href="...">2</a></li>
                <li><a href="...">3</a></li>
                <li><a href="...">4</a></li>
                <li><a href="...">5</a></li>
                <li><a class="end" href="...">>></li>
        </ul>
</div>

班级&#34;活跃&#34;当你走到下一页时移动,所以在第5页,它是balise&#34; li&#34;就在最后一个让班级活跃的人面前! 在balise&#34; li&#34;之后我抓住了这个项目。与班级#34;活跃&#34;那样:

next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'

它完美地适用于5首页......但它并不适用于第6页抓住应答器&#34; a&#34;与班级结束......

我试试:

    try:
        next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
        next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
    except (ValueError,IndexError):
        next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'
        next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()

有人有想法吗? :) 感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

import codecs
from lxml import etree

test_xml = """<div class="pagination toolbarbloc">
        <ul>
                <li class="active"><span>1</span></li>
                <li><a href="1href">2</a></li>
                <li><a href="2href">3</a></li>
                <li><a href="3href">4</a></li>
                <li><a href="4href">5</a></li>
                <li><a class="end" href="5href">>></li>
        </ul>
</div>"""

tree = etree.HTML(test_xml)
rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')

print rep
# ['1href', '2href', '3href', '4href', '5href']

我想知道我是否完全明白你所说的话。如果你真的想要这样的python函数,也许它可以帮助你。