如何在PHP中使用DomDocument或XPath获取HTML文档的确切结构?

时间:2015-07-19 15:07:26

标签: php html xpath domdocument

我有一个HTML文档,例如:

<!DOCTYPE html>
<html>
<head>
    <title>Webpage</title>
</head>
<body>
<div class="content">
    <div>
        <p>Paragraph</p>
    </div>
    <div>
        <a href="someurl">This is an anchor</a>
    </div>
    <p>This is a paragraph inside a div</p>
</div>
</body>
</html>

我想获取具有content类的div的确切结构。

如果我使用getElementsByTagName()方法获取div,则在PHP中使用DomDocument,我得到了这个:

    DOMElement Object
  (
    [tagName] => div
    [schemaTypeInfo] => 
    [nodeName] => div
    [nodeValue] => 

        Paragraph


        This is an anchor

    This is a paragraph inside a div

    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => div
    [baseURI] => 
    [textContent] => 

        Paragraph


        This is an anchor

    This is a paragraph inside a div

)

我怎样才能得到这个:

<div class="content">
    <div>
        <p>Paragraph</p>
    </div>
    <div>
        <a href="someurl">This is an anchor</a>
    </div>
    <p>This is a paragraph inside a div</p>
</div>

有没有办法做到这一点?

1 个答案:

答案 0 :(得分:0)

假设,$ str包含HTML

// Create DomDocument
$doc = new DomDocument();
$doc->loadHTML($str);
// Find needed div
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class = "content"]');
// What to do if divs more that one?
if ($elements->length != 1) die("some divs in the document have class 'content'");
// Take first
$div = $elements->item(0);
// Echo content of node $div
echo $doc->saveHTML($div);

结果

<div class="content">
    <div>
        <p>Paragraph</p>
    </div>
    <div>
        <a href="someurl">This is an anchor</a>
    </div>
    <p>This is a paragraph inside a div</p>
</div>