下面是一些包含在div标签内的随机不可预测的标签集。如何爆炸所有子标签innerHTML保留其出现的顺序。
注意:对于img,iframe标记只需要提取网址。
<div>
<p>para-1</p>
<p>para-2</p>
<p>
text-before-image
<img src="text-image-src"/>
text-after-image</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>
预期数组:
["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]
相关代码:
$dom = new DOMDocument();
@$dom->loadHTML( $content );
$tags = $dom->getElementsByTagName( 'p' );
// Get all the paragraph tags, to iterate its nodes.
$j = 0;
foreach ( $tags as $tag ) {
// get_inner_html() to preserve the node's text & tags
$con[ $j ] = $this->get_inner_html( $tag );
// Check if the Node has html content or not
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
// Check if the node contains html along with plain text with out any tags
if ( $tag->nodeValue != '' ) {
/*
* DOM to get the Image SRC of a node
*/
$domM = new DOMDocument();
/*
* Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
* Set after initilizing DomDocument();
*/
$con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
@$domM->loadHTML( $con[ $j ] );
$y = new DOMXPath( $domM );
foreach ( $y->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
$domC = new DOMDocument();
@$domC->loadHTML( $con[ $j ] );
$z = new DOMXPath( $domC );
foreach ( $z->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
} else {
/*
* DOM to get the Image SRC of a node
*/
$domA = new DOMDocument();
@$domA->loadHTML( $con[ $j ] );
$x = new DOMXPath( $domA );
foreach ( $x->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
}
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
foreach ( $x->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
}
}
}
}
// INcrement the node
$j++;
}
$this->content = $con;
答案 0 :(得分:1)
尝试递归方法!在类实例和函数$parts
上获取一个空数组extractSomething(DOMNode $source)
。你的功能应该是每个单独的情况,然后返回。如果来源是
现在,当对extractSomenting(yourRootDiv)的调用返回时,您将获得$ this-&gt;部分中的列表。
请注意,您尚未定义<p> sometext1 <img href="ref" /> sometext2 <p>
会发生什么,但上面的示例正在推动添加3个元素(&#34; sometext1&#34;,&#34; ref&#34;和&#34; sometext2&#34;)代表它。
这只是解决方案的大致轮廓。关键是你需要处理树中的每个节点(可能不是真正关于它的位置),并且在按照正确的顺序进行处理时,通过将每个节点转换为所需的文本来构建数组。递归是编码最快的,但您也可以尝试使用宽度遍历或walker工具。
底线是您必须完成两项任务:以正确的顺序遍历节点,将每个节点转换为所需的结果。
这基本上是处理树/图结构的经验法则。
答案 1 :(得分:1)
从DOM文档中提取有趣信息的快速简便方法是使用XPath。下面是一个基本示例,演示如何从div元素中获取文本内容和属性文本。
<?php
// Pre-amble, scroll down to interesting stuff...
$html = '<div>
<p>para-1</p>
<p>para-2</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>';
$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);
// Interesting stuff:
// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/@*', $div);
// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//@src|.//@href
// or whitelist: descendant::*/@*[name()="src" or name()="href"]
// or blacklist: descendant::*/@*[not(name()="ignore")]
// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) {
$trimmed_text = trim($text->nodeValue);
if ($trimmed_text !== '') {
$results[] = $trimmed_text;
}
}
// Let's see what we have
var_dump($results);
答案 2 :(得分:-1)
最简单的方法是使用DOMDocument: http://www.php.net/manual/en/domdocument.loadhtmlfile.php