基于分隔符切片HTML

时间:2017-08-22 15:47:07

标签: php html parsing dom dom-manipulation

我正在将Word文档动态转换为HTML并需要根据分隔符解析所述HTML。例如:

<div id="div1">
    <p>
        <font>
            <b>[[delimiter]]Start of content section 1.</b>
        </font>
    </p>
    <p>
        <span>More content in section 1</span>
    </p>
</div>

应解析为:

第1节:

<div id="div2">
    <p>
        <b>
            <font>[[delimiter]]Start of section 2</font>
        </b>
    <p>
    <span>More content in section 2</span>
    <p></p>
<div>

第2节:

<div id="div2">
    <p>
        <b>

        </b>
    <p>
    <p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
    <span><font>More content in section 3</font></span>
</div>

第3节:

$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);

foreach ($body->childNodes as $child) {
    if ($child->hasChildNodes()) {
        // Do recursive call...
    } else {
        // Contains slide identifier?
    }
}
  1. 我不能简单地根据分隔符“爆炸”/切片,因为这会破坏HTML。每一段文字内容都有许多父元素。

  2. 我无法控制HTML结构,有时会根据Word文档的结构进行更改。最终用户将导入要在应用程序中解析的Word文档,因此生成的HTML在解析之前不会被更改。

  3. 内容通常位于HTML的不同深度。

  4. 我不能依赖元素类或ID,因为它们在doc到doc之间不一致。 #div1,#div2和#div3仅用于我的示例中。

  5. 我的目标是解析内容,所以如果剩下的空元素没问题,我可以简单地再次运行标记并删除空标记(p,font,b等)。

  6. 我的尝试:

    我使用PHP DOM扩展来解析HTML并循环遍历节点。但我无法想出一个很好的算法来解决这个问题。

    {{1}}

2 个答案:

答案 0 :(得分:7)

为了解决这样的问题,您首先需要在开始编码之前制定解决方案所需的步骤。

  1. 找到以[[delimiter]]
  2. 开头的元素
  3. 检查其父母是否有next sibling
  4. 否?重复2次
  5. 是?下一个兄弟包含内容。
  6. 现在,一旦你开始使用它,你已经准备好了90%。您需要做的就是清理不必要的标签并完成。

    要获得可以延伸的内容,不要构建一组可行的混淆代码,但要将所需的所有数据拆分成可以使用的内容。

    下面的代码可以使用两个完全符合您需求的类,并且一旦您需要它们,就可以很好地通过所有元素。它确实使用PHP Simple HTML DOM Parser代替DOMDocument,因为我更喜欢它。

    <?php
    error_reporting(E_ALL);
    require_once("simple_html_dom.php");
    
    $html = <<<XML
    <body>
            <div id="div1">
                    <p>
                            <font>
                                    <b>[[delimiter]]Start of content section 1.</b>
                            </font>
                    </p>
                    <p>
                            <span>More content in section 1</span>
                    </p>
            </div>
            <div id="div2">
                    <p>
                            <b>
                                    <font>[[delimiter]]Start of section 2</font>
                            </b>
                    </p>
                    <span>More content in section 2</span>
                    <p>
                            <font>[[delimiter]]Start of section 3</font>
                    </p>
            </div>
            <div id="div3">
                    <span>
                            <font>More content in section 3</font>
                    </span>
            </div>
    </body>
    XML;
    
    
    
    /*
     * CALL
     */
    
    $parser = new HtmlParser($html, '[[delimiter]]');
    
    //dump found
    //decode/encode to only show public values
    print_r(json_decode(json_encode($parser)));
    
    
    /*
     * ACTUAL CODE
     */
    
    
    class HtmlParser
    {
        private $_html;
        private $_delimiter;
        private $_dom;
    
        public $Elements = array();
    
        final public function __construct($html, $delimiter)
        {
            $this->_html = $html;
            $this->_delimiter = $delimiter;
            $this->_dom = str_get_html($this->_html);
    
            $this->getElements();
        }
    
        final private function getElements()
        {
            //this will find all elements, including parent elements
            //it will also select the actual text as an element, without surrounding tags
            $elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");
    
            //find the actual elements that start with the delimiter
            foreach($elements as $element) {
                //we want the element without tags, so we search for outertext
                if (strpos($element->outertext, $this->_delimiter)===0) {
                    $this->Elements[] = new DelimiterTag($element);
                }
            }
    
        }
    
    }
    
    class DelimiterTag
    {
        private $_element;
    
        public $Content;
        public $MoreContent;
    
        final public function __construct($element)
        {
            $this->_element = $element;
            $this->Content = $element->outertext;
    
    
            $this->findMore();
        }
    
        final private function findMore()
        {
            //we need to traverse up until we find a parent that has a next sibling
            //we need to keep track of the child, to cleanup the last parent
            $child = $this->_element;
            $parent = $child->parent();
            $next = null;
            while($parent) {
                $next = $parent->next_sibling();
    
                if ($next) {
                    break;
                }
                $child = $parent;
                $parent = $child->parent();
            }
    
            if (!$next) {
                //no more content
                return;
            }
    
            //create empty element, to build the new data
            //go up one more element and clean the innertext
            $more = $parent->parent();
            $more->innertext = "";
    
            //add the parent, because this is where the actual content lies
            //but we only want to add the child to the parent, in case there are more delimiters
            $parent->innertext = $child->outertext;
            $more->innertext .= $parent->outertext;
    
            //add the next sibling, because this is where more content lies
            $more->innertext .= $next->outertext;
    
            //set the variables
            if ($more->tag=="body") {
                //Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
                //That's why i show the innertext for the root tag and the outer text for others.
                $this->MoreContent = $more->innertext;
            } else {
                $this->MoreContent = $more->outertext;
            }
    
        }
    }
    
    
    
    
    ?>
    

    清理输出:

    stdClass Object
    (
      [Elements] => Array
      (
        [0] => stdClass Object
        (
            [Content] => [[delimiter]]Start of content section 1.
            [MoreContent] => <div id="div1">
                                <p><font><b>[[delimiter]]Start of content section 1.</b></font></p>
                                <p><span>More content in section 1</span></p>
                              </div>
        )
    
        [1] => stdClass Object
        (
            [Content] => [[delimiter]]Start of section 2
            [MoreContent] => <div id="div2">
                                <p><b><font>[[delimiter]]Start of section 2</font></b></p>
                                <span>More content in section 2</span>
                             </div>
        )
    
        [2] => stdClass Object
        (
            [Content] => [[delimiter]]Start of section 3
            [MoreContent] => <div id="div2">
                                <p><font>[[delimiter]]Start of section 3</font></p>
                             </div>
                             <div id="div3">
                                <span><font>More content in section 3</font></span>
                              </div>
        )
      )
    )
    

答案 1 :(得分:3)

到目前为止,我最近的是...

$html = <<<XML
<body>
    <div id="div1">
        <p>
            <font>
                <b>[[delimiter]]Start of content section 1.</b>
            </font>
        </p>
        <p>
            <span>More content in section 1</span>
        </p>
    </div>
    <div id="div2">
        <p>
            <b>
                <font>[[delimiter]]Start of section 2</font>
            </b>
        </p>
        <span>More content in section 2</span>
        <p>
            <font>[[delimiter]]Start of section 3</font>
        </p>
    </div>
    <div id="div3">
        <span>
            <font>More content in section 3</font>
        </span>
    </div>
</body>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");

foreach ($div as $child) {
    echo "Div=".$doc->saveHTML($child).PHP_EOL;
}

echo "Last bit...".$doc->saveHTML($child).PHP_EOL;
$div = $xp->query("following-sibling::*", $child);
foreach ($div as $remain) {
    echo $doc->saveHTML($remain).PHP_EOL;
}

我认为我必须调整HTML以纠正(希望)错误的遗漏</div>

看到它有多强大但很难测试会很有趣。

&#39;最后一位&#39;尝试将元素与in中的最后一个标记(在本例中为div2)一起使用,直到文档结尾(使用following-sibling::*)。

另请注意,它假定body标记是文档的基础。因此需要调整以适合您的文档。可能只需将其更改为//body...

即可

<强>更新 具有更大的灵活性,能够处理同一整体细分市场中的多个部分...

$html = <<<XML
    <html>
    <body>
        <div id="div1">
            <p>
                <font>
                    <b>[[delimiter]]Start of content section 1.</b>
                </font>
            </p>
            <p>
                <span>More content in section 1</span>
            </p>
        </div>
        <div id="div1a">
            <p>
                <span>More content in section 1</span>
            </p>
        </div>
        <div id="div2">
            <p>
                <b>
                    <font>[[delimiter]]Start of section 2</font>
                </b>
            </p>
            <span>More content in section 2</span>
            <p>
                <font>[[delimiter]]Start of section 3</font>
            </p>
        </div>
        <div id="div3">
            <span>
                <font>More content in section 3</font>
            </span>
        </div>
    </body>
    </html>
XML;

$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("//body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");

$partCount = $div->length;
for ( $i = 0; $i < $partCount; $i++ )  {
    echo "Div $i...".$doc->saveHTML($div->item($i)).PHP_EOL;

    // Check for multiple sections in same element
    $count = $xp->evaluate("count(descendant::*[contains(text(),'[[delimiter]]')])",
            $div->item($i));
    if ( $count > 1 )   {
        echo PHP_EOL.PHP_EOL;
        for ($j = 0; $j< $count; $j++ ) {
            echo "Div $i.$j...".$doc->saveHTML($div->item($i)).PHP_EOL;
        }
    }
    $div = $xp->query("following-sibling::*", $div->item($i));
    foreach ($div as $remain) {
        if ( $i < $partCount-1 && $remain === $div->item($i+1)  )   {
            break;
        }
        echo $doc->saveHTML($remain).PHP_EOL;
    }

    echo PHP_EOL.PHP_EOL;
}