我有一个HTML:
<div class="abc">
<div class="xyz">
<div class="needremove"></div>
<p>text</p>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
</div>
我用过: response.xpath( '// DIV [含有(@class, “ABC”)] /格[含有(@class, “XYZ”)]')。提取物()
结果:
u'['<div class="xyz">
<div class="needremove"></div>
<p>text</p>
<p>text</p>
<p>text</p>
<p>text</p>
</div>']
我想删除<div class="needremove"></div>
。你能救我吗?
答案 0 :(得分:1)
除div
和class="needremove"
之外,您可以获取所有子标记:
response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()
来自shell的演示:
$ scrapy shell index.html
In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract()
Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>', u'<p>text</p>']