我正在编写一个基本的分类工具,它将获取标题,然后将其与一系列关键字进行比较。例如:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
是否有创造性的方法来循环浏览这些类别或查看哪个类别匹配最多?请注意,在'dining'数组中,我有正则表达式来匹配单词candy上的变体。我尝试了下面这些,但是这些类别列表变得很长,我想知道这是否是最佳方式:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
谢谢, 史蒂夫
编辑: 感谢@jmathai,我能够添加排名:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
答案 0 :(得分:4)
这可以通过一个循环来完成。
为了提高效率,我会将糖果和糖果分成不同的条目。一个聪明的技巧是用一些令牌替换匹配。我们用10#了。
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
答案 1 :(得分:2)
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
答案 2 :(得分:1)
提供单词的数量不是太大,然后创建反向查找表可能是一个想法,然后针对它运行标题。
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
答案 3 :(得分:0)
您正在执行O(n * m)查找,其中n是您的类别的大小,m是标题的大小。您可以尝试像这样组织它们:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
然后,对于标题中的每个单词,选中$categories[$word]
以查找类别 - 这样可以获得O(m)。
答案 4 :(得分:0)
好的,这是我的新答案,让你在$ cat [n]值中使用正则表达式...关于这段代码的唯一警告我无法弄清楚...出于某种原因,如果你有任何原因它会失败$ cat [n]值开头的元字符或字符类。
示例:.*food
无效。但s.afood
或sea.*
等...或cand(y|ies)
的示例可行。我觉得这对你来说已经足够了,因为我认为正则表达式的目的是处理不同时态的单词,并且在这种情况下单词的开头很少改变。
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);