Question

我有一个HTML：

<div class="abc">
            <div class="xyz">
                <div class="needremove"></div>
                <p>text</p>
                <p>text</p>
                <p>text</p>
                <p>text</p>
            </div>
    </div>

我用过： response.xpath（ '// DIV [含有（@class， “ABC”）] /格[含有（@class， “XYZ”）]'）。提取物（）

结果：

u'['<div class="xyz">
        <div class="needremove"></div>
        <p>text</p>
        <p>text</p>
        <p>text</p>
        <p>text</p>
    </div>']

我想删除<div class="needremove"></div>。你能救我吗？

Answer 1

除div和class="needremove"之外，您可以获取所有子标记：

response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()

来自shell的演示：

$ scrapy shell index.html
In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()
Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>', u'<p>text</p>']

使用python＆amp;删除第一个标签html; scrapy

1 个答案: