xpath没有抓住内容

时间:2013-11-27 18:50:35

标签: php dom xpath

我有以下xpath查询:

//div[@class="row"]//div[@class="post-container"]//div[contains(@class,"post-content")]//p

我正在尝试从以下网址获取文章内容:

http://gawker.com/u-s-pulls-ahead-in-taylor-swift-education-continues-t-1445261687

它似乎不起作用。我期待的是包含所有DOMNodes标签的p数组。

这是我的代码:

error_reporting(E_ERROR);
$domDocument = new DOMDocument('1.0','UTF-8');
$urlText = file_get_contents($url);
$domDocument->loadHTML($urlText);
$finder = new DOMXPath($domDocument);
$xpath = '//div[@class="row"]//div[@class="post-container"]//div[contains(@class,"post-content")]//p';
$xpathContents = $finder->query($xpath);

注意:我需要使用file_get_contents来获取额外的解析逻辑

1 个答案:

答案 0 :(得分:0)

原因是由于第一个<script>标记下包含一些无效的<p>标记。它正在关闭一些document.write代码中的标签,我猜这是“弄乱DOMDocument / DOMXPath的头”。

它不是太优雅,但您可以通过将已读取的文档加载到SimpleXML中,并根据data-textannontation-id(避免讨厌的脚本标记)执行修改的XPath查询来避免这种情况。

error_reporting(E_ERROR);
$domDocument = new DOMDocument('1.0','UTF-8');
$urlText = file_get_contents('g');
$domDocument->loadHTML($urlText);
$sxe = simplexml_import_dom($domDocument);
$xpath = '//*[@data-textannotation-id]';
$xpathContents = $sxe->xpath($xpath);

// What do you know, there is 8 items here... woop woop!
print_r($xpathContents);

// bonus points, here is the text output!
$text = '';
foreach ($xpathContents as $node) {
    $text .= trim((string)dom_import_simplexml($node)->textContent);
}
echo $text;