<?php
$badWords = array("ban","bad","user","pass","stack","name","html");
$string = "Hello my name is user.";
$matches = array();
$matchFound = preg_match_all(
"/\b(" . implode($badWords,"|") . ")\b/i",
$string,
$matches
);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
echo "<li>" . $word . "</li>";
}
echo "</ul>";
}
?>
但当我将$ badWords改为希伯来语时:
$badWords = array("עזה","חמאס");
并将文本($ string)更改为希伯来语:
$string = "חמאס רוצה להרוג אותנו ולא יצליח";
它不起作用。
为什么?
它的英文效果很好!
答案 0 :(得分:1)
您只需告知正则表达式引擎您正在使用的模式包含utf-8个字符,您必须更改字符类\w
和单词边界\b
的含义处理utf-8字符(因为默认情况下\w
仅包含ascii字母)。要做到这一点,你有两种方法:
使用u修饰符:
$matchFound = preg_match_all(
"/\b(" . implode($badWords,"|") . ")\b/iu",
$string,
$matches
);
或将(*UTF8)(*UCP)
放在模式的最开头:
$matchFound = preg_match_all(
"/(*UTF8)(*UCP)\b(" . implode($badWords,"|") . ")\b/i",
$string,
$matches
);
(*UTF8)
通知正则表达式引擎必须将模式字符串视为utf8字符串。
(*UCP)
将\w
更改为默认[a-zA-Z0-9_]
至[\p{L}\p{N}_]