我遇到了一个php functin问题,用于优化mssql查询的搜索字符串。
我需要找一个看起来像霍比特人的条目,'通过搜索霍比特人'。 如果他们在搜索字符串中有一个尾随空格,我考虑过剪切文章(在德国,我们有'' die'和' das')。
我的功能如下:
DisplayName
但这不起作用......
也许使用正则表达式有更好的解决方案?
答案 0 :(得分:4)
1。)要使用regex like this从字符串的开头或结尾中删除一个停用词:
~^\W*(der|die|das|the)\W+\b|\b\W+(?1)\W*$~i
~
是pattern delimiter ^
插入符anchor匹配字符串的开头\W
(上方)是字符的short,不是 word character (der|die|das|the)
替换|
\b
与 word boundary (?1)
粘贴第一组模式$
紧跟在字符串i
(PCRE_CASELESS) flag。如果输入为utf-8,还需要u
(PCRE_UTF8)标志。Reference - What does this regex mean
生成模式:
// array containing stopwords
$stopwords = array("der", "die", "das", "the");
// escape the stopword array and implode with pipe
$s = '~^\W*('.implode("|", array_map("preg_quote", $stopwords)).')\W+\b|\b\W+(?1)\W*$~i';
// replace with emptystring
$searchString = preg_replace($s, "", $searchString);
注意如果~
数组中出现$stopwords
分隔符,则必须使用反斜杠对其进行转义。
PHP test at eval.in,Regex pattern at regex101
2。)但是删除字符串中任何位置的停用词如何分割成单词:
// words to be removed
$stopwords = array(
'der' => 1,
'die' => 1,
'das' => 1,
'the' => 1);
# used words as key for better performance
// remove stopwords from string
function strip_stopwords($str = "")
{
global $stopwords;
// 1.) break string into words
// [^-\w\'] matches characters, that are not [0-9a-zA-Z_-']
// if input is unicode/utf-8, the u flag is needed: /pattern/u
$words = preg_split('/[^-\w\']+/', $str, -1, PREG_SPLIT_NO_EMPTY);
// 2.) if we have at least 2 words, remove stopwords
if(count($words) > 1)
{
$words = array_filter($words, function ($w) use (&$stopwords) {
return !isset($stopwords[strtolower($w)]);
# if utf-8: mb_strtolower($w, "utf-8")
});
}
// check if not too much was removed such as "the the" would return empty
if(!empty($words))
return implode(" ", $words);
return $str;
}
// test it
echo strip_stopwords("The Hobbit das foo, der");
Hobbit foo
此解决方案还会移除除_
-
'
之外的任何标点符号,因为在删除常用字词后,它会用空格破坏剩余的字词。我们的想法是为查询准备字符串。
两种解决方案都不会修改案例,如果只包含一个停用词,则会保留字符串。
常用词汇列表
答案 1 :(得分:2)
@Jonny 5提供的解决方案似乎是我解决方案的最佳解决方案。
现在我使用这样的函数:
public function optimizeSearchString($searchString = "")
{
$stopwords = array(
'der' => 1,
'die' => 1,
'das' => 1,
'the' => 1);
$words = preg_split('/[^-\w\']+/', $searchString, -1, PREG_SPLIT_NO_EMPTY);
if (count($words) > 1) {
$words = array_filter($words, function ($v) use (&$stopwords) {
return !isset($stopwords[strtolower($v)]);
}
);
}
if (empty($words)) {
return $searchString;
}
return implode(" ", $words);
}
Jonny 5的新解决方案也可以使用,但是我使用这个,因为我对正则表达式并不熟悉,我知道发生了什么: - )
答案 2 :(得分:1)
这就是我的所作所为。
public function optimizeSearchString($searchString) {
$wordsFromSearchString = str_word_count($searchString, true);
$finalWords = array_diff($wordsFromSearchString, $stopwords);
return implode(" ", $finalWords);
}
答案 3 :(得分:0)
我使用array_diff
制作了另一个版本,@ Yashrajsinh Jadeja也做了。我添加了第三个参数“ strcasecmp”以忽略大小写,并使用简单的单词标记器将输入设为数组。
//Search string with article
$searchString = "Das blaue Haus"; //"The blue house"
//Split string into array. (This method is insufficient and doesn't account for compound nouns like "blue jay" or "einfamilienhaus".)
$wordArray = preg_split('/[^-\w\']+/', $searchString, -1, PREG_SPLIT_NO_EMPTY);
var_dump(optimizeSearchString($wordArray));
function optimizeSearchString($wordArray) {
$articles = array('der', 'die', 'das', 'the');
$newArray = array_udiff($wordArray, $articles, 'strcasecmp');
return $newArray;
}
输出:
array(2) {
[1]=>
string(5) "blaue"
[2]=>
string(4) "Haus"
}
答案 4 :(得分:0)
public function optimizeSearchString($searchString)
{
$articles = (
'der ',
'die ',
'das ',
'the '
);
foreach ($articles as $article) {
//only cut $article out of $searchString if its longer than the $article itself
if (strlen($searchString) > strlen($article) && strpos($searchString, $article)) {
$searchString = str_replace($article, '', $searchString);
break;
}
}
return $searchString;
}