我的正则表达式的目标是拆分任何unicode空格,不包括换行符,并确保将换行符添加到前一个非unicode空格字符。目前我看到这项工作,但仅适用于\ n。
之前的单个空白字符使用我目前的正则表达式:
$data = "the\nquick\n brown fox jumped \nover the lazy dog.";
$tokenized = preg_split("~(?<=\n)|\p{Z}+(?!\n)~u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);
当前结果(我添加了\ n,其中&#34; \ n&#34;字符存在):
Array
(
[0] => Array
(
[0] => the\n
[1] => 0
)
[1] => Array
(
[0] => quick\n
[1] => 4
)
[2] => Array
(
[0] =>
[1] => 10
)
[3] => Array
(
[0] => brown
[1] => 11
)
[4] => Array
(
[0] => fox
[1] => 17
)
[5] => Array
(
[0] => jumped
[1] => 21
)
[6] => Array
(
[0] => \n
[1] => 31
)
[7] => Array
(
[0] => over
[1] => 33
)
[8] => Array
(
[0] => the
[1] => 38
)
[9] => Array
(
[0] => lazy
[1] => 42
)
[10] => Array
(
[0] => dog.
[1] => 47
)
)
预期结果:
Array
(
[0] => Array
(
[0] => the\n
[1] => 0
)
[1] => Array
(
[0] => quick\n
[1] => 4
)
[2] => Array
(
[0] => brown
[1] => 10
)
[3] => Array
(
[0] => fox
[1] => 16
)
[4] => Array
(
[0] => jumped\n
[1] => 20
)
[5] => Array
(
[0] => over
[1] => 27
)
[6] => Array
(
[0] => the
[1] => 32
)
[7] => Array
(
[0] => lazy
[1] => 36
)
[8] => Array
(
[0] => dog.
[1] => 41
)
)
任何建议都非常感谢。感谢。