Question

我有一个CKeditor在图像周围输出一些标签。到目前为止，我正在使用正则表达式来摆脱那些包装标签。

以下是一些测试字符串：

$example1 = '<p data-entity-type="" data-entity-uuid="" style="text-align: center;"><span><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" /><span title="Click and drag to resize">•</span></span></p>';
$example2 = '<p><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" /></p>';
$example3 = '<html>
<head></head>
<body>
some text here...
<p><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" />
</p>
</body>
</html>';
// Wanted result : <html><head></head><body>some text here...<img alt="julie-bishop.jpg" data-entity-type="" data-entity-uuid="" height="349" src="/sites/default/files/inline-images/julie-bishop.jpg" width="620" /></body></html>

我尝试的正则表达式是/(.*?)\s*(<img[^<]+?)\s*<\/p>(.*)/，这与example2完美配合。

preg_replace("/(.*?)<p>\s*(<img[^<]+?)\s*<\/p>(.*)/", "$1$2$3", $string);

规则是：如果您检测到＆lt; p＆gt;与＆lt; img＆gt;作为其中一个孩子然后保持＆lt; img＆gt;并删除＆lt; p＆gt;和其他孩子（可以跨越或其他什么......）

知道如何实现我的需求吗？

Answer 1

您可以使用以下正则表达式：

<p(?:[^>]*|\r\n|\n)>(?:.*|\r\n|\n)(<img(?:[^>]*|\r\n|\n)>)(?:.*|\r\n|\n)<\/p>

这是regex101.com上的demo

这是eval.in中的working demo（您的PHP代码）

Answer 2

您应用的方法并不好，而不是REGEX您应该使用DOMDocument。我们在这里使用DOMDocument和DOMXPath。我希望我的解决方案可以帮助您解决问题。

<?php
ini_set('display_errors', 1);
$example1 = '<p data-entity-type="" data-entity-uuid="" style="text-align: center;"><span><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" /><span title="Click and drag to resize">•</span></span></p>';
$example2 = '<p><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" /></p>';
$example3 = '<html>
<head></head>
<body>
some text here...
<p><img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620" />
</p>
</body>
</html>';


$domDocument= new DOMDocument();
$domDocument->loadHTML($example1,LIBXML_HTML_NOIMPLIED);
$domXPath=new DOMXPath($domDocument);

if($domXPath->query("//html")->length)
{
    foreach($domXPath->query("//p") as $pelement)
    {
        if($domXPath->query("//img",$pelement)->length)
        {
            $pelement->parentNode->replaceChild(getReplacement($domXPath),$pelement);
        }
    }
    echo $pelement->ownerDocument->saveHTML();
}
else
{
    echo getReplacement($domXPath,true);
}

function getReplacement($domXPath,$string=false)
{
    global $domDocument;
    $results=$domXPath->query('//p');
    foreach($results as $result)
    {
        if($innerNodes=$domXPath->query("//img",$result->childNodes->item(0)))
        {
            if($string===true)
            {
                return $domDocument->saveHTML($result->childNodes->item(0));
            }
            else 
            {
                return $result->childNodes->item(0);
            }
        }
    }
}

string1的输出：

<img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620">Ã¢Â€Â¢

string2的输出：

<img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620">

string3的输出：

<html> <head></head> <body> some text here... <img alt="image.jpg" data-entity-type="" data-entity-uuid="" height="349" src="image.jpg" width="620"> </body> </html>

Answer 3

由于仅涉及 TAGS ，特别是相邻的标签<img../>
它可以很容易地用正则表达式完成。

问题是如果没有涉及到所有标签必须匹配和跳过上面的顺序。

所有标签必须匹配的原因是标签可以隐藏在内部隐藏内容和评论。

但是，php为你提供了(*SKIP)(*FAIL)回溯控制动词的力量它可以匹配，但更重要的是，跳过其他标签和隐藏的内容，没有以匹配的形式出现在正则表达式中。

而且，当与几个原子组组合在一起时，速度很快。

此结果显示50次迭代次数为130K html source = 6.5 MB html，持续时间为2/3秒。

Regex1:   (?><p\s*>\s*(<img\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?>)\s*</p\s*>)|(?><(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\2\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)(*SKIP)(*FAIL)
Completed iterations:   50  /  50     ( x 1 )
Matches found per iteration:   2
Elapsed Time:    0.68 s,   683.32 ms,   683318 µs

https://regex101.com/r/CCyNZ5/1

查找（stringed）：

替换$1

Formatted

     # Just '<p><img../></p>'
     (?>
          <p \s* >
          \s* 
          (                             # (1 start)
               <img
               \s+ 
               (?:
                    " [\S\s]*? " 
                 |  ' [\S\s]*? ' 
                 |  [^>]? 
               )+
               \s* /?
               >
          )                             # (1 end)
          \s* 
          </p \s* >
     )

  |  # Or,

     # Skip all other tags and invisible content
     (?>
          <
          (?:
               (?:
                    (?:
                                                       # Invisible content; end tag req'd
                         (                             # (2 start)
                              script
                           |  style
                           |  object
                           |  embed
                           |  applet
                           |  noframes
                           |  noscript
                           |  noembed 
                         )                             # (2 end)
                         (?:
                              \s+ 
                              (?>
                                   " [\S\s]*? "
                                |  ' [\S\s]*? '
                                |  (?:
                                        (?! /> )
                                        [^>] 
                                   )?
                              )+
                         )?
                         \s* >
                    )

                    [\S\s]*? </ \2 \s* 
                    (?= > )
               )

            |  (?: /? [\w:]+ \s* /? )
            |  (?:
                    [\w:]+ 
                    \s+ 
                    (?:
                         " [\S\s]*? " 
                      |  ' [\S\s]*? ' 
                      |  [^>]? 
                    )+
                    \s* /?
               )
            |  \? [\S\s]*? \?
            |  (?:
                    !
                    (?:
                         (?: DOCTYPE [\S\s]*? )
                      |  (?: \[CDATA\[ [\S\s]*? \]\] )
                      |  (?: -- [\S\s]*? -- )
                      |  (?: ATTLIST [\S\s]*? )
                      |  (?: ENTITY [\S\s]*? )
                      |  (?: ELEMENT [\S\s]*? )
                    )
               )
          )
          >
     )
     (*SKIP)(*FAIL)

删除图像的包装标签

3 个答案: