我正在使用preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);
并在phpliveregex.com上运行
它产生数组:
array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)
不是我想要的,它应该只在html标签之外的空格分开,如下所示:
array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)
在这个正则表达式中,我只能在“双引号”中添加例外,但真的需要帮助才能添加更多内容,例如标记<img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
关于正则表达式如何运作的任何解释也表示赞赏。
答案 0 :(得分:2)
使用DOMDocument
更容易,因为您不需要描述html标记是什么以及它的外观。您只需要检查nodeType。当它是textNode时,请将其与preg_match_all
分开(它比设计preg_split
的模式更方便:
$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
$nodeList = $dom->documentElement->childNodes;
$results = [];
foreach ($nodeList as $childNode) {
if ($childNode->nodeType == XML_TEXT_NODE &&
preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
$results = array_merge($results, $m[0]);
else
$results[] = $dom->saveHTML($childNode);
}
print_r($results);
注意:当双引号部分保持未闭合(没有结束引用)时,我选择了默认行为,随时可以更改它。
注2:有时LIBXML_
常量未定义。您可以在之前测试它并在需要时定义它来解决此问题:
if (!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED', 8192);
答案 1 :(得分:0)
而不是使用拆分命令只匹配您想要的部分
<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*
现场演示
https://regex101.com/r/bK8iL3/1
示例文字
注意第二段中的困难边缘情况
<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"
some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
样本匹配
MATCH 1
0. [0-11] `<b>test</b>`
MATCH 2
0. [11-15] ` or `
MATCH 3
0. [15-38] `<strong> this </strong>`
MATCH 4
0. [38-56] `<em> oh yeah </em>`
MATCH 5
0. [56-61] ` and `
MATCH 6
0. [61-75] `<i>oh yeah</i>`
MATCH 7
0. [75-111] ` Here we are "ye we 'hold' it" some`
MATCH 8
0. [111-117] `<img/>`
MATCH 9
0. [117-121] `gfsf`
MATCH 10
0. [121-213] `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`
MATCH 11
0. [213-224] `<pre></pre>`
MATCH 12
0. [224-237] `<code></code>`
MATCH 13
0. [237-254] `<strong></strong>`
MATCH 14
0. [254-261] `<b></b>`
MATCH 15
0. [261-270] `<em></em>`
MATCH 16
0. [270-277] `<i></i>`
NODE EXPLANATION
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
img 'img'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\/] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
a 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pre 'pre'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
code 'code'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
strong 'strong'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
b 'b'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
em 'em'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\\] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\\'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^"<]* any character except: '"', '<' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------