是否可以非贪婪地抓取XPATH?我的意思是例如我有这个HTML:
<div>
<p>A</p>
<p>B</p>
<h2>Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
我想要一个仅包含A和B在内的段落的XPATH。最近的h2
节点内的文本始终在变化,因此,如果可能,我需要非贪婪的XPATH。可能吗?又如何?
答案 0 :(得分:1)
尝试使用此xpath
//div/p[following::h2[contains(.,'Only until this node')]]
从html元素中获取所需的内容,直到它击中包含文本p
的{{1}}元素。
查看以下示例:
Only until this node
产生的东西:
from scrapy import Selector
htmldoc="""
<div>
<p>A</p>
<p>B</p>
<p>C</p>
<p>D</p>
<h2>Only until this node</h2>
<p>E</p>
<p>F</p>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
<p>I should not even this</p>
</div>
"""
sel = Selector(text=htmldoc)
for item in sel.xpath("//div/p[following::h2[contains(.,'Only until this node')]]/text()").extract():
print(item)
答案 1 :(得分:1)
我假设<h2>Only until this node</h2>
是动态的,您可以选择h2
的第一个索引
//div/h2[1]/preceding-sibling::p
var htmlString = `
<body>
<div>
<p>A</p>
<p>B</p>
<h2>Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
<div>
<p>A1</p>
<p>B2</p>
<p>C3</p>
<h2>Second Only until this node</h2>
<p>I should not get this</p>
<h2>Even though this node exists</h2>
</div>
</body>`;
var doc = new DOMParser().parseFromString(htmlString, 'text/xml');
var iterator = doc.evaluate('//div/h2[1]/preceding-sibling::p', doc, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null);
var thisNode = iterator.iterateNext();
while (thisNode) {
console.log(thisNode.outerHTML);
thisNode = iterator.iterateNext();
}
答案 2 :(得分:0)
您可以尝试以下XPath-1.0表达式:
/div/p[following-sibling::*[self::h2='Only until this node']]
它获得所有p
个元素,这些元素具有h2
继承者,且其text()
值为“仅直到此节点”。