使用python&删除第一个标签html; scrapy

时间:2015-06-05 08:05:41

标签: python xpath scrapy scrapy-spider

我有一个HTML:

<div class="abc">
            <div class="xyz">
                <div class="needremove"></div>
                <p>text</p>
                <p>text</p>
                <p>text</p>
                <p>text</p>
            </div>
    </div>

我用过:     response.xpath( '// DIV [含有(@class, “ABC”)] /格[含有(@class, “XYZ”)]')。提取物()

结果:

u'['<div class="xyz">
        <div class="needremove"></div>
        <p>text</p>
        <p>text</p>
        <p>text</p>
        <p>text</p>
    </div>']

我想删除<div class="needremove"></div>。你能救我吗?

1 个答案:

答案 0 :(得分:1)

divclass="needremove"之外,您可以获取所有子标记:

response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()

来自shell的演示:

$ scrapy shell index.html
In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()
Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>', u'<p>text</p>']