preg_replace unicode characters

时间:2018-01-19 01:20:54

标签: php unicode preg-replace preg-match

我有几个包含unicode的字符串。我的任务是从除了unicode之外的这些字符串中删除所有内容,例如,在

下面
\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

会变成

\ud83d\ude82 \u2600\ufe0f \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

然后我需要查找重复代码,并将它们分开以便:

 \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

变为:

\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29

我已经为第一位尝试了几个preg_match解决方案,但它要么不删除字符串中的任何字符,要么删除所有内容。以下是最新的尝试,

/(^\\\u[0-9a-f]{4})+/

对Regex不太熟悉,我开始困惑,因为我不确定还有什么可以尝试。

这样,最终,我能够将每个unicode作为自己的记录插入数据库。

1 个答案:

答案 0 :(得分:0)

可以分两步完成:

$str = '\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29';
// remove non unicode character
$str = preg_replace('/(?<=\\\\u[a-f0-9]{4})[^\\\\]+/', '', $str);
// insert space between repeated pair
$str = preg_replace('/((?:\\\u[a-f0-9]{4}){2})(?=\1)/', '$1 ', $str);
echo $str,"\n";

<强>输出:

\ud83d\ude82\u2600\ufe0f\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29

正则表达式#1:

/                       : regex delimiter
  (?<=                  : lookahead
    \\\\u[a-f0-9]{4}    : unicode character
  )                     : end lookahead
  [^\\\\]+              : 1 or more any character that is NOT a backslash
/                       : regex delimiter

正则表达式#2:

/                       : regex delimiter
  (                     : start group 1
    (?:                 : non capture group
      \\\\u[a-f0-9]{4}  : a unicode character
    ){2}                : appears twice (2 unicode characters)
  )                     : end group 1
  (?=\1)                : lookahead, group 1 is repeated
/                       : regex delimiter