Question

由于我无法使用preg_match（UTF8支持以某种方式被破坏，它在本地工作但在生产中断）我想找到另一种方法来匹配黑名单中的单词。问题是，我想搜索字符串只搜索完全匹配，而不是第一次出现字符串。

这是我用preg_match

的方式

preg_match('/\b(badword)\b/', strtolower($string));

示例字符串：

$string = "This is a string containing badwords and one badword";

我想只匹配“坏词”（最后），而不是“坏词”。

strpos('badword', $string) matches the first one

有什么想法吗？

Answer 1

假设您可以进行一些预处理，您可以使用白色空格替换所有标点符号，并将所有内容放在小写中，然后：

在while循环中使用strpos之类的strpos(' badword ', $string)来继续遍历整个文档;
将字符串拆分为空格，并将每个单词与您所拥有的不良单词列表进行比较。

因此，如果您在尝试第一个选项的地方，它会像这样（未经测试的伪代码）

$documet = body of text to process . ' ' 
$document.replace('!@#$%^&*(),./...', ' ')
$document.toLowerCase()
$arr_badWords = [...]
foreach($word in badwords)
{
    $badwordIndex = strpos(' ' . $word . ' ', $document)
    while(!badWordIndex)
    {
        //
        $badwordIndex = strpos($word, $document)
    }
}

编辑：根据@jonhopkins建议，在末尾添加一个空白区域应该满足那些希望单词位于文档末尾并且没有标点符号的情况。

Answer 2

如果你想模仿正则表达式的\b修饰符，你可以尝试这样的事情：

$offset = 0;
$word = 'badword';
$matched = array();
while(($pos = strpos($string, $word, $offset)) !== false) {
    $leftBoundary = false;
    // If is the first char, it has a boundary on the right
    if ($pos === 0) {
       $leftBoundary = true;
    // Else, if it is on the middle of the string, we must check the previous char
    } elseif ($pos > 0 && in_array($string[$pos-1], array(' ', '-',...)) {
        $leftBoundary = true;
    }

    $rightBoundary = false;
    // If is the last char, it has a boundary on the right
    if ($pos === (strlen($string) - 1)) {
       $rightBoundary = true;
    // Else, if it is on the middle of the string, we must check the next char
    } elseif ($pos < (strlen($string) - 1) && in_array($string[$pos+1], array(' ', '-',...)) {
        $rightBoundary = true;
    }

    // If it has both boundaries, we add the index to the matched ones...
    if ($leftBoundary && $rightBoundary) {
        $matched[] = $pos;
    }

    $offset = $pos + strlen($word);
}

Answer 3

您可以使用strrpos()代替strpos：

strrpos - 查找字符串
中最后一次出现的子字符串的位置

$string = "This is a string containing badwords and one badword";
var_dump(strrpos($string, 'badword'));

输出：

Answer 4

使用具有unicode属性的单词边界的简单方法：

preg_match('/(?:^|[^pL\pN_])(badword)(?:[^pL\pN_]|$)/u', $string);

实际上它要复杂得多，请看 here 。

只有没有正则表达式才能匹配整个单词

4 个答案: