PHP preg_split在空格上,但不在标签内

时间:2016-06-19 10:22:39

标签: php html regex preg-split

我正在使用preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);并在phpliveregex.com上运行 它产生数组:

array(10
  0=><b>test</b>
  1=>or
  2=><em>oh
  3=>yeah</em>
  4=>and
  5=><i>
  6=>oh
  7=>yeah
  8=></i>
  9=>"ye we 'hold' it"
)

不是我想要的,它应该只在html标签之外的空格分开,如下所示:

array(5
  0=><b>test</b>
  1=>or
  2=><em>oh yeah</em>
  3=>and
  4=><i>oh yeah</i>
  5=>"ye we 'hold' it"
)

在这个正则表达式中,我只能在“双引号”中添加例外,但真的需要帮助才能添加更多内容,例如标记<img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

关于正则表达式如何运作的任何解释也表示赞赏。

2 个答案:

答案 0 :(得分:2)

使用DOMDocument更容易,因为您不需要描述html标记是什么以及它的外观。您只需要检查nodeType。当它是textNode时,请将其与preg_match_all 分开(它比设计preg_split 的模式更方便:

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';

$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);

$nodeList = $dom->documentElement->childNodes;

$results = [];

foreach ($nodeList as $childNode) {
    if ($childNode->nodeType == XML_TEXT_NODE &&
        preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
        $results = array_merge($results, $m[0]);
    else
        $results[] = $dom->saveHTML($childNode);
}

print_r($results);

注意:当双引号部分保持未闭合(没有结束引用)时,我选择了默认行为,随时可以更改它。

注2:有时LIBXML_常量未定义。您可以在之前测试它并在需要时定义它来解决此问题:

if (!defined('LIBXML_HTML_NOIMPLIED'))
    define('LIBXML_HTML_NOIMPLIED', 8192);

答案 1 :(得分:0)

描述

而不是使用拆分命令只匹配您想要的部分

<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*

Regular expression visualization

实施例

现场演示

https://regex101.com/r/bK8iL3/1

示例文字

注意第二段中的困难边缘情况

<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"

some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

样本匹配

MATCH 1
0.  [0-11]  `<b>test</b>`

MATCH 2
0.  [11-15] ` or `

MATCH 3
0.  [15-38] `<strong> this </strong>`

MATCH 4
0.  [38-56] `<em> oh yeah </em>`

MATCH 5
0.  [56-61] ` and `

MATCH 6
0.  [61-75] `<i>oh yeah</i>`

MATCH 7
0.  [75-111]    ` Here we are "ye we 'hold' it" some`

MATCH 8
0.  [111-117]   `<img/>`

MATCH 9
0.  [117-121]   `gfsf`

MATCH 10
0.  [121-213]   `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`

MATCH 11
0.  [213-224]   `<pre></pre>`

MATCH 12
0.  [224-237]   `<code></code>`

MATCH 13
0.  [237-254]   `<strong></strong>`

MATCH 14
0.  [254-261]   `<b></b>`

MATCH 15
0.  [261-270]   `<em></em>`

MATCH 16
0.  [270-277]   `<i></i>`

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      img                      'img'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\/]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      a                        'a'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      pre                      'pre'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      code                     'code'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      strong                   'strong'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      b                        'b'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      em                       'em'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      i                        'i'
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\\]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\\'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    \/                       '/'
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^"<]*                   any character except: '"', '<' (0 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------