为什么以下代码对于不同的多字节字符串的行为有所不同?
echo preg_replace('@(?=\pL)@u', '*', 'م'); // prints: '*م' ✓
echo preg_replace('@(?=\pL)@u', '*', 'ض'); // prints: '*ض' ✓
echo preg_replace('@(?=\pL)@u', '*', 'غ'); // prints: '*�*�' ✗
echo preg_replace('@(?=\pL)@u', '*', 'ص'); // prints: '*�*�' ✗
答案 0 :(得分:2)
您还需要包含修饰符(Lm
)。请参阅以下脚本迭代整个阿拉伯语unicode块:
<?php
function uchar_2($dec)
{
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
$utf .= chr(128 + ($dec % 64));
return $utf;
}
$issues = 0;
$count = 0;
for ($dec = 1536; $dec <= 1791; $dec++) {
$char = uchar_2($dec);
if (preg_replace('@^(?=\pLm)$@u', '*', $char) !== $char) {
printf("Issue with %s (%s)\n", $dec, $char);
$issues++;
}
$count++;
}
printf("Found %d issues in %d rows\n", $issues, $count);
如果没有Lm
,大约一半的字符就会失败。