我正在将Word文档动态转换为HTML并需要根据分隔符解析所述HTML。例如:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
应解析为:
第1节:
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p></p>
<div>
第2节:
<div id="div2">
<p>
<b>
</b>
<p>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
第3节:
$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child) {
if ($child->hasChildNodes()) {
// Do recursive call...
} else {
// Contains slide identifier?
}
}
我不能简单地根据分隔符“爆炸”/切片,因为这会破坏HTML。每一段文字内容都有许多父元素。
我无法控制HTML结构,有时会根据Word文档的结构进行更改。最终用户将导入要在应用程序中解析的Word文档,因此生成的HTML在解析之前不会被更改。
内容通常位于HTML的不同深度。
我不能依赖元素类或ID,因为它们在doc到doc之间不一致。 #div1,#div2和#div3仅用于我的示例中。
我的目标是解析内容,所以如果剩下的空元素没问题,我可以简单地再次运行标记并删除空标记(p,font,b等)。
我的尝试:
我使用PHP DOM扩展来解析HTML并循环遍历节点。但我无法想出一个很好的算法来解决这个问题。
{{1}}
答案 0 :(得分:7)
为了解决这样的问题,您首先需要在开始编码之前制定解决方案所需的步骤。
next sibling
现在,一旦你开始使用它,你已经准备好了90%。您需要做的就是清理不必要的标签并完成。
要获得可以延伸的内容,不要构建一组可行的混淆代码,但要将所需的所有数据拆分成可以使用的内容。
下面的代码可以使用两个完全符合您需求的类,并且一旦您需要它们,就可以很好地通过所有元素。它确实使用PHP Simple HTML DOM Parser代替DOMDocument
,因为我更喜欢它。
<?php
error_reporting(E_ALL);
require_once("simple_html_dom.php");
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
/*
* CALL
*/
$parser = new HtmlParser($html, '[[delimiter]]');
//dump found
//decode/encode to only show public values
print_r(json_decode(json_encode($parser)));
/*
* ACTUAL CODE
*/
class HtmlParser
{
private $_html;
private $_delimiter;
private $_dom;
public $Elements = array();
final public function __construct($html, $delimiter)
{
$this->_html = $html;
$this->_delimiter = $delimiter;
$this->_dom = str_get_html($this->_html);
$this->getElements();
}
final private function getElements()
{
//this will find all elements, including parent elements
//it will also select the actual text as an element, without surrounding tags
$elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");
//find the actual elements that start with the delimiter
foreach($elements as $element) {
//we want the element without tags, so we search for outertext
if (strpos($element->outertext, $this->_delimiter)===0) {
$this->Elements[] = new DelimiterTag($element);
}
}
}
}
class DelimiterTag
{
private $_element;
public $Content;
public $MoreContent;
final public function __construct($element)
{
$this->_element = $element;
$this->Content = $element->outertext;
$this->findMore();
}
final private function findMore()
{
//we need to traverse up until we find a parent that has a next sibling
//we need to keep track of the child, to cleanup the last parent
$child = $this->_element;
$parent = $child->parent();
$next = null;
while($parent) {
$next = $parent->next_sibling();
if ($next) {
break;
}
$child = $parent;
$parent = $child->parent();
}
if (!$next) {
//no more content
return;
}
//create empty element, to build the new data
//go up one more element and clean the innertext
$more = $parent->parent();
$more->innertext = "";
//add the parent, because this is where the actual content lies
//but we only want to add the child to the parent, in case there are more delimiters
$parent->innertext = $child->outertext;
$more->innertext .= $parent->outertext;
//add the next sibling, because this is where more content lies
$more->innertext .= $next->outertext;
//set the variables
if ($more->tag=="body") {
//Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
//That's why i show the innertext for the root tag and the outer text for others.
$this->MoreContent = $more->innertext;
} else {
$this->MoreContent = $more->outertext;
}
}
}
?>
清理输出:
stdClass Object
(
[Elements] => Array
(
[0] => stdClass Object
(
[Content] => [[delimiter]]Start of content section 1.
[MoreContent] => <div id="div1">
<p><font><b>[[delimiter]]Start of content section 1.</b></font></p>
<p><span>More content in section 1</span></p>
</div>
)
[1] => stdClass Object
(
[Content] => [[delimiter]]Start of section 2
[MoreContent] => <div id="div2">
<p><b><font>[[delimiter]]Start of section 2</font></b></p>
<span>More content in section 2</span>
</div>
)
[2] => stdClass Object
(
[Content] => [[delimiter]]Start of section 3
[MoreContent] => <div id="div2">
<p><font>[[delimiter]]Start of section 3</font></p>
</div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
)
)
)
答案 1 :(得分:3)
到目前为止,我最近的是...
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
foreach ($div as $child) {
echo "Div=".$doc->saveHTML($child).PHP_EOL;
}
echo "Last bit...".$doc->saveHTML($child).PHP_EOL;
$div = $xp->query("following-sibling::*", $child);
foreach ($div as $remain) {
echo $doc->saveHTML($remain).PHP_EOL;
}
我认为我必须调整HTML以纠正(希望)错误的遗漏</div>
。
看到它有多强大但很难测试会很有趣。
&#39;最后一位&#39;尝试将元素与in中的最后一个标记(在本例中为div2)一起使用,直到文档结尾(使用following-sibling::*
)。
另请注意,它假定body标记是文档的基础。因此需要调整以适合您的文档。可能只需将其更改为//body...
<强>更新强> 具有更大的灵活性,能够处理同一整体细分市场中的多个部分...
$html = <<<XML
<html>
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div1a">
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
</html>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("//body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
$partCount = $div->length;
for ( $i = 0; $i < $partCount; $i++ ) {
echo "Div $i...".$doc->saveHTML($div->item($i)).PHP_EOL;
// Check for multiple sections in same element
$count = $xp->evaluate("count(descendant::*[contains(text(),'[[delimiter]]')])",
$div->item($i));
if ( $count > 1 ) {
echo PHP_EOL.PHP_EOL;
for ($j = 0; $j< $count; $j++ ) {
echo "Div $i.$j...".$doc->saveHTML($div->item($i)).PHP_EOL;
}
}
$div = $xp->query("following-sibling::*", $div->item($i));
foreach ($div as $remain) {
if ( $i < $partCount-1 && $remain === $div->item($i+1) ) {
break;
}
echo $doc->saveHTML($remain).PHP_EOL;
}
echo PHP_EOL.PHP_EOL;
}