Question

作为搜索结果，我获得了搜索词周围的内容。但这只是整个页面的一个子部分，它只包含搜索词附近的标签。如果匹配（开放/结束）更远，我会使用不平衡的HTML标签。当浏览器尝试平衡它使用完整其他级别的标记时，这些不平衡标签可以包含页面布局。

示例

这可能是整个页面：

<li>
  <h3>Ang my oniuse.</h3> 
  <p>Oh! any or said faing ear Dand and tion on so wor st wouter and abox 
  a makess stand he he sne at mon the nany ing a me come hink floney a 
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat seelectler</h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
  thend that ance, he ned and me lood says wou hed set pidays far it
  conted, and seell yarty.</p>
</li>

搜索seelectler可能会导致HTML部分如下：

  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

现在p标签和li标签是不平衡的，并且使用结束标签，浏览器会尝试关闭p标签（可能在整个找到的文本周围）以及可能在每个找到的条目周围的li标签。登记/> 但是这些标签的下一个开头有错误的css类，而li和p之间的一些div标签现在是无法匹配的，最后的结束可能会关闭列布局中的div标签。

结果：整个页面布局已损坏。

希望的结果可能是（所有未配对的标签都是配对的，这不是万无一失的）：

<li><p>
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
</p></li>

或：

  naiday. Smiler yousee lurneremiley boll his a grog.
  <h3>I'l hat <b>seelectler</b></h3> 
  Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

但此解决方案可能会失去重要的布局，例如换行符。

是否存在可以通过添加缺少的部分或删除剩余部分来清除不平衡HTML标记的viewhelper？
是否有用于检测不平衡标签的算法/正则表达式？

Answer 1

我建议从搜索结果中删除所有html标签。并使用明文搜索结果。

可能会通过使用换行符替换某些标记来创建一些次要的“格式化”。

Answer 2

我找到的最近的解决方案是使用此视图帮助：

<?php
namespace MyCompany\MyExtension\ViewHelpers;

use TYPO3\CMS\Fluid\Core\ViewHelper\AbstractViewHelper;

/**
 * fills in missing xml tags
 */
class BalanceXmlViewHelper extends AbstractViewHelper
{

    /**
     * balances XML-fragment with additional tags
     *
     * @param string $xmlIn
     * @return string
     */
    public function render($xmlIn = null)
    {
        if (null === $xmlIn) {
            $xmlIn = $this->renderChildren();
        }

        $xmlDoc = new \DOMDocument();
        // it's UTF-8 data!
        $xmlDoc->loadHTML('<?xml encoding="UTF-8">' . $xmlIn
              // we want no complete HTML-document, so neglect some default-tags
            , LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
        );

        // remove the additional charset tag and replace german umlauts
        $retVal = html_entity_decode(mb_substr($xmlDoc->saveHTML(),23)
                                    ,ENT_COMPAT | ENT_HTML401
                                    );


        return $retVal;
    }
}

我知道它可以保留无效标签（例如没有UL的LI标签），但它比删除所有标签（stripHTML（））更精确，这导致文本没有换行符或甚至删除块标签后的空格。

如何清理HTML源代码部分？

2 个答案: