Question

我将以下字符串存储在包含HTML的数据库表中，在HTML呈现到网页上之前，我需要将其删除（这是我无法控制的旧内容）。

<p>I am <30 years old and weight <12st</p>

使用strip_tags后，它只会显示I am。

我了解为何strip_tags会这样做，所以我需要将<的2个实例替换为<

我找到了一个可转换第一个实例但不转换第二个实例的正则表达式，但是我不知道如何修改此实例以替换所有实例。

/<([^>]*)(<|$)/

产生I am currently <30 years old and less than

我在这里有一个演示https://eval.in/1117956

Answer 1

尝试用字符串函数（包括正则表达式函数）解析html内容是一个坏主意（有很多主题可以在SO上进行解释，请进行搜索）。 html太复杂了。

问题是您无法控制html格式错误。有两种可能的态度：

无事可做：数据已损坏，因此信息一劳永逸，而您无法检索已消失的事物，仅此而已。这是一个完全可以接受的观点。可能是您可以在某处找到相同数据的另一个来源，也可以选择打印格式不正确的html。
您可以尝试维修。在这种情况下，您必须确保所有文档问题都受到限制并且可以解决（至少手动解决）。

您可以通过DOMDocument使用PHP libxml实现来代替直接字符串方法。即使libxml解析器不会提供比strip_tags更好的结果，它也会提供错误，您可以使用这些错误来识别错误的类型并找到html字符串中有问题的位置。

libxml解析器使用您的字符串返回可恢复错误XML_ERR_NAME_REQUIRED，每个有问题的开口尖括号上的代码为68。使用libxml_get_errors()可以看到错误。

字符串示例：

$s = '<p>I am <30 years old and weight <12st</p>';

$libxmlErrorState = libxml_use_internal_errors(true);

function getLastErrorPos($code) {
    $errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
        return $e->code === $code;
    });

    if ( !$errors )
        return false;

    $lastError = array_pop($errors);
    return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}

define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';

$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
    libxml_clear_errors();
    $pattern = vsprintf($patternTemplate, $position);

    $s = preg_replace($pattern, '&lt;', $s, 1);
    $dom = new DOMDocument;
    $dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}

echo $dom->saveHTML();

libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);

demo

$patternTemplate是带格式的字符串（请参阅php手册中的sprintf），其中的占位符%d代表行的开头数和行的开头位置。（这里是0和8）

样式详细信息：样式的目标是从字符串的开头到达尖括号位置。

~ # my favorite pattern delimiter
  (?:
      .* # all character until the end of the line
      \R # the newline sequence
  ){0} # reach the desired line

  .{8} # reach the desired column
  \K   # remove all on the left from the match result
  <    # the match result is only this character
~A # anchor the pattern at the start of the string

另一个使用类似技术的相关问题：parse invalid XML manually

Answer 2

尝试

$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !@#$%^&*().]/u', '', $html); //remove special character

Live DEMO

Answer 3

只需使用String string = MyApplication.getContext().getString(...);即可。

将str_replace()替换为<p> and </p>
将[p] and [/p]替换为<
放回p标签，即将<替换为[p] and [/p]

代码

<p> and </p>

结果

<?php
$description = "<p>I am <30 years old and weight <12st</p>";

$d = str_replace(['[p]','[/p]'],['<p>','</p>'], 
            str_replace('<', '&lt;', 
                str_replace(['<p>','</p>'], ['[p]','[/p]'], 
                    $description)));

echo $d;

Answer 4

我的猜测是，我们可能希望在此处设计一个良好的右边界以捕获非标签中的<，也许是一个类似于以下内容的简单表达式：

<(\s*[+-]?[0-9])

可能会起作用，因为我们通常应该在<之后加上数字或符号。如果我们在[+-]?[0-9]之后有其他实例，<可能会发生变化。

Demo

测试

$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am <  30 years old and weight <  12st I am <30 years old and weight <  -12st I am <  +30 years old and weight <  12st</p>';
$subst = '&lt;$1';

$result = preg_replace($re, $subst, $str);

echo $result;

如何在也使用strip_tags的php字符串中替换多个小于<的实例？

4 个答案:

Demo

测试