php xPath打印整个html表

时间:2017-09-11 14:47:37

标签: php html xpath

在我的代码中,我试图获取整个HTML代码并忽略旧网站上的所有JavaScripts(AdSense代码)。我有大约800页,我很难一个一个地复制。我面临的主要问题是我的Xpath太长并且每次都给我一个错误,其次它只打印文本而不是HTML代码。我不知道如何解决它。

我的XPath

/html/body/div/div/div/div[4]/table/tbody/tr/td/div/h2/table/tbody/tr/td/div[1]/table/tbody/tr/td[1]/div/table/tbody/tr/td/div/table/tbody/tr/td/div/table/tbody/tr/td/div

我收到的错误可在https://pastebin.com/FFRLr3vq

处获得

我当前的PHP代码

error_reporting(E_ERROR);
$urls[] = "http://myoldwebsite.com/somepage.html";

function curlload($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
        $source = curl_exec($ch);
        return $source;
}

foreach ($urls as $url) {
$source = curlLoad($url);
@$doc = new DOMDocument();
@$doc->loadHTML($source);   

$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[@class='pageContent']");

// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
}

1 个答案:

答案 0 :(得分:1)

要输出加载的HTML,您可以使用

http://php.net/manual/de/domdocument.savehtml.php

要删除script代码(如聊天中所述),您可以使用以下内容:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

来源&amp;更多信息:remove script tag from HTML content