php匹配字符串到多个关键字数组

时间:2011-02-05 00:57:18

标签: php regex arrays

我正在编写一个基本的分类工具,它将获取标题,然后将其与一系列关键字进行比较。例如:

$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

是否有创造性的方法来循环浏览这些类别或查看哪个类别匹配最多?请注意,在'dining'数组中,我有正则表达式来匹配单词candy上的变体。我尝试了下面这些,但是这些类别列表变得很长,我想知道这是否是最佳方式:

$keywordRegex = implode("|",$cat['dining']); 
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);

谢谢, 史蒂夫

编辑: 感谢@jmathai,我能够添加排名:

    $matches = array(); 
    foreach($keywords as $k => $v) {
        str_replace($v, '#####', $masterString,$count);
        if($count > 0){
            $matches[$k] = $count;
        }
    }
    arsort($matches);

5 个答案:

答案 0 :(得分:4)

这可以通过一个循环来完成。

为了提高效率,我会将糖果和糖果分成不同的条目。一个聪明的技巧是用一些令牌替换匹配。我们用10#了。

$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
  $replaced = str_replace($v, '##########', $string);
  preg_match_all('/##########/i', $replaced, $matches);
  if(count($matches[0]) > $max[1]) {
    $max[0] = $k;
    $max[1] = count($matches[0]);
  }
}

echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";

答案 1 :(得分:2)

$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

$string = explode(' ',$string);
foreach ($cat as $key => $val) {
  $kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);

echo "<pre>";
print_r($kwdMatches);

答案 2 :(得分:1)

提供单词的数量不是太大,然后创建反向查找表可能是一个想法,然后针对它运行标题。

// One-time reverse category creation
$reverseCat = array();    
foreach ($cat as $cCategory => $cWordList) {
   foreach ($cWordList as $cWord) {
       if (!array_key_exists($cWord, $reverseCat)) {
           $reverseCat[$cWord] = array($cCategory);
       } else if (!in_array($cCategory, $reverseCat[$cWord])) {
           $reverseCat[$cWord][] = $cCategory;
       }
   }
}

// Processing a title
$stringWords = preg_split("/\b/", $string);

$matchingCategories = array();
foreach ($stringWords as $cWord) {
   if (array_key_exists($cWord, $reverseCat)) {
       $matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
   }
}

$matchingCategories = array_unique($matchingCategories);

答案 3 :(得分:0)

您正在执行O(n * m)查找,其中n是您的类别的大小,m是标题的大小。您可以尝试像这样组织它们:

const $DINING = 0;
const $SERVICES = 1;

$categories = array(
    "food" => $DINING,
    "restaurant" => $DINING,
    "service" => $SERVICES,
);

然后,对于标题中的每个单词,选中$categories[$word]以查找类别 - 这样可以获得O(m)。

答案 4 :(得分:0)

好的,这是我的新答案,让你在$ cat [n]值中使用正则表达式...关于这段代码的唯一警告我无法弄清楚...出于某种原因,如果你有任何原因它会失败$ cat [n]值开头的元字符或字符类。

示例:.*food无效。但s.afoodsea.*等...或cand(y|ies)的示例可行。我觉得这对你来说已经足够了,因为我认为正则表达式的目的是处理不同时态的单词,并且在这种情况下单词的开头很少改变。

function rMatch ($a,$b) {
  if (preg_match('~^'.$b.'$~i',$a)) return 0;
  if ($a>$b) return 1;
  return -1;
}

$string = explode(' ',$string);
foreach ($cat as $key => $val) {
  $kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);

echo "<pre>";
print_r($kwdMatches);