非贪婪的XPATH在最近的h2节点之前获取HTML

时间:2018-12-15 00:38:31

标签: xpath web-scraping web-crawler

是否可以非贪婪地抓取XPATH?我的意思是例如我有这个HTML:

<div>
    <p>A</p>
    <p>B</p>
    <h2>Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
</div>

我想要一个仅包含A和B在内的段落的XPATH。最近的h2节点内的文本始终在变化,因此,如果可能,我需要非贪婪的XPATH。可能吗?又如何?

3 个答案:

答案 0 :(得分:1)

尝试使用此xpath

//div/p[following::h2[contains(.,'Only until this node')]]

从html元素中获取所需的内容,直到它击中包含文本p的{​​{1}}元素。

查看以下示例:

Only until this node

产生的东西:

from scrapy import Selector

htmldoc="""
<div>
    <p>A</p>
    <p>B</p>
    <p>C</p>
    <p>D</p>
    <h2>Only until this node</h2>
    <p>E</p>
    <p>F</p>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
    <p>I should not even this</p>
</div>
"""
sel = Selector(text=htmldoc)
for item in sel.xpath("//div/p[following::h2[contains(.,'Only until this node')]]/text()").extract():
    print(item)

答案 1 :(得分:1)

我假设<h2>Only until this node</h2>是动态的,您可以选择h2的第一个索引

//div/h2[1]/preceding-sibling::p

var htmlString = `
<body>
  <div>
    <p>A</p>
    <p>B</p>
    <h2>Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
  </div>
  <div>
    <p>A1</p>
    <p>B2</p>
    <p>C3</p>
    <h2>Second Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
  </div>
</body>`;

var doc = new DOMParser().parseFromString(htmlString, 'text/xml');
var iterator = doc.evaluate('//div/h2[1]/preceding-sibling::p', doc, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null);
var thisNode = iterator.iterateNext();
while (thisNode) {
  console.log(thisNode.outerHTML);
  thisNode = iterator.iterateNext();
}

答案 2 :(得分:0)

您可以尝试以下XPath-1.0表达式:

/div/p[following-sibling::*[self::h2='Only until this node']]

它获得所有p个元素,这些元素具有h2继承者,且其text()值为“仅直到此节点”。