RegEx用于拼写错误的全文搜索

时间:2013-02-17 16:29:26

标签: php regex full-text-search regex-group levenshtein-distance

我有一个包含以下列的MySQL表:

City      Country  Continent
New York  States   Noth America
New York  Germany  Europe - considering there's one ;)
Paris     France   Europe

如果我想用拼写错误找到“New Yokr”,那么使用MySQL存储函数很容易:

$querylev = "select City, Country, Continent FROM table 
            WHERE LEVENSHTEIN(`City`,'New Yokr') < 3"

但是如果有两个纽约城市,用全文搜索你可以把“纽约州”,你得到你想要的结果。

所以问题是,我可以搜索“New Yokr Statse”并得到相同的结果吗?

是否有任何函数合并levenshtein和fulltext来制作一个解决方案,或者我应该在MySQL中创建一个连接3列的新列?

我知道还有其他解决方案,如lucene或Sphinx(也是soundex,metaphone,但对此无效),但我认为对我来说可能很难实现它们。

1 个答案:

答案 0 :(得分:0)

一个好问题,也是一个很好的示例,说明了如何使用字符列表和正则表达式边界来设计查询和检索所需的数据。

根据我们可能想要的准确性以及数据库中所拥有的数据,我们可以确定地基于各种表达式来设计自定义查询,例如New York State的以下示例具有各种类型:

([new]+\s+[york]+\s+[stae]+)

在这里,我们有三个字符列表,可以用其他可能的字母进行更新。

[new]
[york]
[stae]

我们还添加了两组\s+作为边界,以提高准确性。

DEMO

此代码段仅显示捕获组的工作方式:

const regex = /([new]+\s+[york]+\s+[stae]+)/gmi;
const str = `Anything we wish to have before followed by a New York Statse then anything we wish to have after. Anything we wish to have before followed by a New  Yokr  State then anything we wish to have after. Anything we wish to have before followed by a New Yokr Stats then anything we wish to have after. Anything we wish to have before followed by a New York Statse then anything we wish to have after. `;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PHP

$re = '/([new]+\s+[york]+\s+[stae]+)/mi';
$str = 'Anything we wish to have before followed by a New York Statse then anything we wish to have after. Anything we wish to have before followed by a New  Yokr  State then anything we wish to have after. Anything we wish to have before followed by a New Yokr Stats then anything we wish to have after. Anything we wish to have before followed by a New York Statse then anything we wish to have after. ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);