我有以下xpath查询:
//div[@class="row"]//div[@class="post-container"]//div[contains(@class,"post-content")]//p
我正在尝试从以下网址获取文章内容:
http://gawker.com/u-s-pulls-ahead-in-taylor-swift-education-continues-t-1445261687
它似乎不起作用。我期待的是包含所有DOMNodes
标签的p
数组。
这是我的代码:
error_reporting(E_ERROR);
$domDocument = new DOMDocument('1.0','UTF-8');
$urlText = file_get_contents($url);
$domDocument->loadHTML($urlText);
$finder = new DOMXPath($domDocument);
$xpath = '//div[@class="row"]//div[@class="post-container"]//div[contains(@class,"post-content")]//p';
$xpathContents = $finder->query($xpath);
注意:我需要使用file_get_contents
来获取额外的解析逻辑
答案 0 :(得分:0)
原因是由于第一个<script>
标记下包含一些无效的<p>
标记。它正在关闭一些document.write代码中的标签,我猜这是“弄乱DOMDocument / DOMXPath的头”。
它不是太优雅,但您可以通过将已读取的文档加载到SimpleXML中,并根据data-textannontation-id(避免讨厌的脚本标记)执行修改的XPath查询来避免这种情况。
error_reporting(E_ERROR);
$domDocument = new DOMDocument('1.0','UTF-8');
$urlText = file_get_contents('g');
$domDocument->loadHTML($urlText);
$sxe = simplexml_import_dom($domDocument);
$xpath = '//*[@data-textannotation-id]';
$xpathContents = $sxe->xpath($xpath);
// What do you know, there is 8 items here... woop woop!
print_r($xpathContents);
// bonus points, here is the text output!
$text = '';
foreach ($xpathContents as $node) {
$text .= trim((string)dom_import_simplexml($node)->textContent);
}
echo $text;