Question

我想从网页中提取几个表并在我的页面中显示它们

我打算使用正则表达式来提取它们，但后来我看到了DOMDocument类看起来更干净我看了stackoverflow，似乎所有问题都是关于获取内部文本或使用循环来获取元素的内部节点。我现在想要如何通过它的id提取和打印html元素。

$html = file_get_contents("www.site.com");
$xml = new DOMDocument();
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[@id='myid']");
$table->saveHTML(); // this obviously doesn't work

如何在我的页面上显示或回显$ table作为实际的html表？

Answer 1

首先，DOMDocument有一个getElementById()方法，所以你的XPath是不必要的 - 虽然我怀疑它是如何工作的。

其次，为了获取标记的片段而不是整个文档，您使用DOMNode::C41N()，因此您的代码将如下所示：

<?php

    // Load the HTML into a DOMDocument
    // Don't forget you could just pass the URL to loadHTML()
    $html = file_get_contents("www.site.com");
    $dom = new DOMDocument('1.0');
    $dom->loadHTML($html);

    // Get the target element
    $element = $dom->getElementById('myid');

    // Get the HTML as a string
    $string = $element->C14N();

查看working example。

Answer 2

您可以使用DOMElement :: C14N（）来获取DOMElement的规范化HTML（XML）表示，或者如果您想要更多控件以便可以过滤某些元素和属性，则可以使用以下内容：< / p>

function toHTML($nodeList, $tagsToStrip=array('script','object','noscript','form','style'),$attributesToSkip=array('on*')) {
$html = '';
foreach($nodeList as $subIndex => $values) {
    if(!in_array(strtolower($values->nodeName), $tagsToStrip)) {
        if(substr($values->nodeName,0,1) != '#') {
            $html .= ' <'.$values->nodeName;
            if($values->attributes) {
                for($i=0;$values->attributes->item($i);$i++) {
                    if( !in_array( strtolower($values->attributes->item($i)->nodeName) , $attributesToSkip ) && (in_array('on*',$attributesToSkip) && substr( strtolower($values->attributes->item($i)->nodeName) ,0 , 2) != 'on') ) {
                        $vvv = $values->attributes->item($i)->nodeValue;
                        if( in_array( strtolower($values->attributes->item($i)->nodeName) , array('src','href') ) ) {
                            $vvv = resolve_href( $this->url , $vvv );
                        }
                        $html .= ' '.$values->attributes->item($i)->nodeName.'="'.$vvv.'"';
                    }
                }
            }
            if(in_array(strtolower($values->nodeName), array('br','img'))) {
                $html .= ' />';
            } else {
                $html .= '> ';
                if(!$values->firstChild) {
                    $html .= htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
                } else {
                    $html .= toHTML($values->childNodes,$tagsToStrip,$attributesToSkip);
                }
                $html .= ' </'.$values->nodeName.'> '; 
            }
        } elseif(substr($values->nodeName,1,1) == 't') {
            $inner = htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
            $html .= $inner;
        }
    }
}
return $html;
}

echo toHTML($table);

使用DOMDocument通过它的id提取和打印html元素

2 个答案: