搜索:匹配多列中的所有单词

时间:2014-05-16 02:34:37

标签: php mysql regex search

我正在尝试向现有数据库添加许多(1700)名称,但需要检查重复项。事实上,我们假设大多数都是重复的。不幸的是,这些名称来自地址标签,并没有按字段分隔(一些是组织名称,一些是人名)。为了减轻人的负担,我想首先在名字上搜索好的比赛。通过良好的匹配,我的意思是我希望名称中的所有单词(John Julie Smith)匹配多个db字段(title,firstname,lastname,suffix,spousename)。因此,如果firstname是John,lastname是Smith,而spousename是Julie,那么匹配,或者如果firstname(在db中)是“John Julie”而lastname是“Smith”,那么它也会匹配。

我已经完成了一个脚本,它将在PHP中完成所有操作,并针对每种可能性运行单独的查询。比如lastname = 'john julie smith'firstname = 'john julie smith' ... lastname = 'john julie' AND firstname = 'smith'等等!这是一个三字名称的105个查询,我有1700个名字要处理。这听起来很荒谬。

PHP我知道得很清楚,但我对MySQL不太了解。是否有可以尝试匹配多列中所有单词的查询?即使它只处理其中一个名称组合(“John,Julie,Smith”或“John Julie,Smith”)。也许甚至使用正则表达式?


这就是我的目标。

foreach( $a as $name ) {
    //There's some more stuff up here to prepare the strings,
    //removing &/and, punctuation, making everything lower case...

    $na = explode( " ", $name );

    $divisions = count( $na ) - 1;
    $poss = array();
    for( $i = 0; $i < pow(2, $divisions); $i++ ) {
        $div = str_pad(decbin($i), $divisions, '0', STR_PAD_LEFT);
        $tpa = array();
        $tps = '';
        foreach($na as $nak => $nav) {
            if ( $nak > 0 && substr( $div, $nak - 1, 1 ) ) {
                $tpa[] = $tps;
                $tps = $nav;
            } else {
                $tps = trim( $tps . ' ' . $nav );
            }
        }
        $tpa[] = $tps;
        $poss[] = $tpa;
    }
    foreach( $poss as $possk => $possv ) {
        $count = count( $possv );
        //Here's where I am... 
        //I could use $count and some math to come up with all the possible searches here,
        //But my head is starting to spin as I try to think of how to do that.
    }

    die();
}

到目前为止,PHP创建了一个数组($ poss),其中包含名称字符串中所有可能的单词排列。对于“John Julie Smith”,阵列看起来像这样:

Array
(
    [0] => Array
        (
            [0] => john julie smith
        )

    [1] => Array
        (
            [0] => john julie
            [1] => smith
        )

    [2] => Array
        (
            [0] => john
            [1] => julie smith
        )

    [3] => Array
        (
            [0] => john
            [1] => julie
            [2] => smith
        )

)

最初的想法是遍历数组并创建大量的查询。对于[0],将有5个查询:

... WHERE firstname = 'john julie smith';
... WHERE lastname = 'john julie smith';
... WHERE spousename = 'john julie smith';
... WHERE title = 'john julie smith';
... WHERE suffix = 'john julie smith';

但是[1]会有20个查询:

... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND spousename = 'smith';
... WHERE firstname = 'john julie' AND title = 'smith';
... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND suffix = 'smith';
... WHERE lastname = 'john julie' AND firstname = 'smith';
... WHERE lastname = 'john julie' AND spousename = 'smith';
... WHERE lastname = 'john julie' AND title = 'smith';
... WHERE lastname = 'john julie' AND lastname = 'smith';
... WHERE lastname = 'john julie' AND suffix = 'smith';
//and on and on

对于[3],将有60个查询!我正在以这个速度查看170,000多个查询!

必须有更好的方法......

1 个答案:

答案 0 :(得分:1)

将1700个名称加载到MySQL中的表中。

然后,我认为以下方法会有所帮助。在字段中查找匹配项,并按匹配最多的行排序。这不是100%完美,我怀疑它会有点帮助。查询是:

select n.name, t.*,
       (n.name like concat('%', firstname, '%') +
        n.name like concat('%', lastname, '%') +
        n.name like concat('%', suffix, '%') +
        n.name like concat('%', spousename, '%')
       ) as NumMatches
from table t join
     names n
     on n.name like concat('%', firstname, '%') or
        n.name like concat('%', lastname, '%') or
        n.name like concat('%', suffix, '%') or
        n.name like concat('%', spousename, '%')
group by t.firstname, t.lastname, t.suffix, t.spousename, n.name
order by NumMatches;

编辑:

我第一次离开了这个,但你可以计算每个name中的单词数和匹配数。将此条款放在order by

之前
having NumMatches = length(n.name) - length(replace(n.n, ' ', '')

这仍然不是完美的,因为同一个名字可能在多个字段中。在实践中,它应该工作得很好。如果你想变得更迂腐,你可以做类似的事情:

having concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(n.name, ' ', 1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 2), ' ', -1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 3), ' ', -1), '%') and
       concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 4), ' ', -1), '%')

这将独立测试每个名称。