Question

我想从我的代码中删除任何额外的空格，我正在解析一个docblock。问题是我不想删除<code>code goes here</code>中的空格。

示例，我使用它来删除额外的空格：

$string = preg_replace('/[ ]{2,}/', '', $string);

但我想在<code></code>

中保留空格

此代码/字符串：

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

应转化为：

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

我该怎么做？

Answer 1

你并不是在寻找一个条件 - 你需要一种跳过部分字符串的方法，这样它们就不会被替换。使用preg_replace，通过插入虚拟组并用自身替换每个组，可以非常轻松地完成此操作。在您的情况下，您只需要一个：

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

它是如何运作的？

(<code>.*?</code>) - 将<code>块与第一组$1匹配。这假设简单的格式化并且没有嵌套，但如果需要可能会很复杂。
^ + - 匹配并删除行首的空格。
[ ]+$ - 匹配并删除行尾的空格。
( ) +匹配行中间的两个或多个空格，并将第一个空格捕获到第二个组$2。

替换字符串$1$2将保留<code>个阻止区和第一个空格（如果已捕获），并删除其匹配的任何内容。

要记住的事情：

如果$1或$2未捕获，则会替换为空字符串。
轮换（a|b|c）从左到右工作 - 当匹配时，它就会满足，并且不再尝试匹配。这就是^ +| +$必须在( ) +之前的原因。

工作示例： http://ideone.com/HxbaV

Answer 2

使用PHP和正则表达式解析标记时，preg_replace_callback()函数与(?R), (?1), (?2)...递归表达式相结合，确实是一个非常强大的工具。以下脚本非常好地处理您的测试数据：

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

以下是在我的测试框上运行时的脚本基准测试结果：WinXP32 PHP 5.2.14（cli）

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

请注意，此解决方案不处理在其属性中具有<>尖括号的CODE标记（可能是非常罕见的边缘情况），但也可以轻松修改正则表达式以处理这些标记。另请注意，最大字符串长度将取决于字符串内容的组成（即Big CODE块会减少最大输入字符串长度。）

P.S。 SO员工注意。 不起作用。

Answer 3

您需要的是使用某种形式的HTML解析器解析它。

例如，您可以使用DOMDocument遍历所有忽略code元素的元素，并从文本节点中删除空格。

或者，使用fopen()打开文件，这样就有一个行数组，如果在code元素之外，则逐行扫描每行。

要确定您是否在code元素中，请查找起始标记<code>并在code元素模式中设置一个标记为的标记。然后，您可以跳过这些行。遇到</code>时重置标记。您可以通过将其状态存储为整数来考虑嵌套，即使嵌套的code元素不是最明智的想法（为什么会嵌套它们）？

Mario在我面前提出这个问题。

Answer 4

使用正则表达式解析HTML是一个坏主意。

RegEx match open tags except XHTML self-contained tags

使用类似Zend_DOM的内容来解析HTML并提取部分内容，以替换空格。

正则表达式内的条件

4 个答案: