希伯来语中的字符串在preg_match_all中不起作用

时间:2014-08-23 17:04:26

标签: php preg-match-all hebrew

<?php

$badWords = array("ban","bad","user","pass","stack","name","html");

$string = "Hello my name is user.";

$matches = array();
$matchFound = preg_match_all(
                "/\b(" . implode($badWords,"|") . ")\b/i", 
                $string, 
                $matches
              );

if ($matchFound) {
  $words = array_unique($matches[0]);
  foreach($words as $word) {
    echo "<li>" . $word . "</li>";
  }
  echo "</ul>";
}
?>

但当我将$ badWords改为希伯来语时:

$badWords = array("עזה","חמאס");

并将文本($ string)更改为希伯来语:

$string = "חמאס רוצה להרוג אותנו ולא יצליח";

它不起作用。

为什么?

它的英文效果很好!

1 个答案:

答案 0 :(得分:1)

您只需告知正则表达式引擎您正在使用的模式包含utf-8个字符,您必须更改字符类\w和单词边界\b的含义处理utf-8字符(因为默认情况下\w仅包含ascii字母)。要做到这一点,你有两种方法:

使用u修饰符:

$matchFound = preg_match_all(
            "/\b(" . implode($badWords,"|") . ")\b/iu", 
            $string, 
            $matches
          );

或将(*UTF8)(*UCP)放在模式的最开头:

$matchFound = preg_match_all(
            "/(*UTF8)(*UCP)\b(" . implode($badWords,"|") . ")\b/i", 
            $string, 
            $matches
          );

(*UTF8)通知正则表达式引擎必须将模式字符串视为utf8字符串。

(*UCP)\w更改为默认[a-zA-Z0-9_][\p{L}\p{N}_]