段落中第一句话的Xpath表达式

时间:2019-06-13 19:33:40

标签: php xml xpath xml-parsing domxpath

我正在寻找段落中第一句话的Xpath表达式。

<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>

结果应为:

A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions.

我尝试了一些尝试都没有用。

$expression = '/html/body/div/div/div/div/p//text()';

我需要使用://p[ends-with还是substring-before

2 个答案:

答案 0 :(得分:2)

您将无法通过XPath解析自然语言,但是您可以将子字符串设置为第一个句点以下,包括以下内容:

substring(/p,1,string-length(substring-before(/p,"."))+1)

请注意,如果在第一句结束前有缩写或其他词法出现,或者第一句以其他标点符号结尾等,则这可能不是“第一句”。


或者更简洁:

concat(substring-before(/p, "."), ".")

信用: ThW在评论中的聪明点子。

答案 1 :(得分:1)

在Xpath级别上,这并不是真正好的方法。 PHP仅具有Xpath 1.0,并且仅支持基本的字符串操作。没有可以考虑的语言环境/语言的内容。但是PHP本身在ext/intl中有一些用。

因此,使用DOM + Xpath作为字符串获取段落元素节点的文本内容,并从中提取第一句。

IntlBreakIterator可以根据语言环境/语言特定规则分割字符串。

$html = <<<'HTML'
<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>
HTML;

$document = new DOMDocument();
$document->loadXML($html);
$xpath = new DOMXpath($document);

// fetch the first paragraph in the document as string
$summary = $xpath->evaluate('string((//p)[1])');
// create a break iterator for en_US sentences.
$breaker = IntlBreakIterator::createSentenceInstance('en_US');
// replace line breaks with spaces before feeding it to the breaker
$breaker->setText(str_replace(["\r\n", "\n"], '', $summary));

$firstSentence = '';
// iterate the sentences
foreach ($breaker->getPartsIterator() as $sentence) {
  $firstSentence = $sentence;
  // break after the first sentence
  break;
}

var_dump($firstSentence);

输出:

string(164) "A federal agency is recommending that White House adviser Kellyanne Conway be removed from federal service saying she violated the Hatch Act on numerous occasions. "

另外DOMXpath允许您注册PHP函数并从Xpath表达式中调用它们。如果您需要在Xpath级别上使用该逻辑(以便在条件下使用它们),则有可能。