我有一个大的弦线和一个针头ndl。现在,我需要从字符串str中找到类似的ndl文本。例如,
来源:“这是一个演示文字,我对此很爱。”
针:“我爱你”
输出:“我爱你”
消息来源:“我有一个独特的想法。您需要一个吗?”。
NEEDLE:“ unik idia”
输出:“一个独特的主意”
我发现我可以使用余弦或曼哈顿相似度之类的相似度来做到这一点。但是,我认为实现该算法将很困难。您能建议我用任何简单或最快的方法来做到这一点,也许使用php的任何库函数吗? TIA
答案 0 :(得分:1)
没有实现此目标的PHP本机功能,但是PHP的可能性仅受您的想象力限制。我们不能建议使用库来实现您的目标,因此您需要牢记此类问题可以标记为离题。因此,除了建议一些库外,我只会为您指明需要探索的方向。
按照设计,您的问题表明您不需要像stripos
这样的简单字符串匹配函数,而co和正则表达式无法实现此目的。例如
独特而独特
还有
这些功能无法匹配想法和想法
。因此,您需要寻找类似levenshtein
function
的东西。但是,由于您需要子字符串而不是整个字符串,因此也需要levenshtein function
和您的在服务器上,您需要发挥一些想象力。例如,您可以同时break
和haystack and needle
两个单词,然后使用levenshtein
找到最接近针的值。
这是实现此目标的一种方法。请仔细阅读注释以了解其思想,然后您将能够实现更好的解决方案。
对于只有ASCII字符的字符串,相对容易实现。但是对于其他编码,您可能会遇到许多困难。但是处理多字节字符串的简单方法也可能是这样的:
function to_ascii($text,$encoding="UTF-8") {
if (is_string($text)) {
// Includes combinations of characters that present as a single glyph
$text = preg_replace_callback('/\X/u', __FUNCTION__, $text);
}
elseif (is_array($text) && count($text) == 1 && is_string($text[0])) {
// IGNORE characters that can't be TRANSLITerated to ASCII
$text = @iconv($encoding, "ASCII//IGNORE//TRANSLIT", $text[0]);
// The documentation says that iconv() returns false on failure but it returns ''
if ($text === '' || !is_string($text)) {
$text = '?';
}
elseif (preg_match('/\w/', $text)) { // If the text contains any letters...
$text = preg_replace('/\W+/', '', $text); // ...then remove all non-letters
}
}
else { // $text was not a string
$text = '';
}
return $text;
}
function find_similar($needle,$str,$keep_needle_order=false){
if(!is_string($needle)||!is_string($str))
{
return false;
}
$valid=array();
//get encodings and words from haystack and needle
setlocale(LC_CTYPE, 'en_GB.UTF8');
$encoding_s=mb_detect_encoding($str);
$encoding_n=mb_detect_encoding($needle);
mb_regex_encoding ($encoding_n);
$pneed=array_filter(mb_split('\W',$needle));
mb_regex_encoding ($encoding_s);
$pstr=array_filter(mb_split('\W',$str));
foreach($pneed as $k=>$word)//loop trough needle's words
{
foreach($pstr as $key=>$w)
{
if($encoding_n!==$encoding_s)
{//if $encodings are not the same make some transliteration
$tmp_word=($encoding_n!=='ASCII')?to_ascii($word,$encoding_n):$word;
$tmp_w=($encoding_s!=='ASCII')?to_ascii($w,$encoding_s):$w;
}else
{
$tmp_word=$word;
$tmp_w=$w;
}
$tmp[$tmp_w]=levenshtein($tmp_w,$tmp_word);//collect levenshtein distances
$keys[$tmp_w]=array($key,$w);
}
$nominees=array_flip(array_keys($tmp,min($tmp)));//get the nominees
$tmp=10000;
foreach($nominees as $nominee=>$idx)
{//test sound like to get more precision
$idx=levenshtein(metaphone($nominee),metaphone($tmp_word));
if($idx<$tmp){
$answer=$nominee;//get the winner
}
unset($nominees[$nominee]);
}
if(!$keep_needle_order){
$valid[$keys[$answer][0]]=$keys[$answer][1];//get the right form of the winner
}
else{
$valid[$k]=$keys[$answer][1];
}
$tmp=$nominees=array();//clean a little for the next iteration
}
if(!$keep_needle_order)
{
ksort($valid);
}
$valid=array_values($valid);//get only the values
/*return the array of the closest value to the
needle according to this algorithm of course*/
return $valid;
}
var_dump(find_similar('i knew you love me','finally i know you loved me and all my pets'));
var_dump(find_similar('I you love','This is a demo text and I love you about this'));
var_dump(find_similar('a unik idia','I have a unique idea. Do you need?'));
var_dump(find_similar("Goebel, Weiss, Goethe, Goethe und Goetz",'Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz'));
var_dump(find_similar('Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ',
'Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ, șếᶑ ᶁⱺ ẽḭŭŝḿꝋď ṫĕᶆᶈṓɍ ỉñḉīḑȋᵭṵńť ṷŧ ḹẩḇőꝛế éȶ đꝍꞎôꝛȇ ᵯáꞡᶇā ąⱡîɋṹẵ.'));
,输出为:
array(5) {
[0]=>
string(1) "i"
[1]=>
string(4) "know"
[2]=>
string(3) "you"
[3]=>
string(5) "loved"
[4]=>
string(2) "me"
}
array(3) {
[0]=>
string(1) "I"
[1]=>
string(4) "love"
[2]=>
string(3) "you"
}
array(3) {
[0]=>
string(1) "a"
[1]=>
string(6) "unique"
[2]=>
string(4) "idea"
}
array(5) {
[0]=>
string(6) "Göbel"
[1]=>
string(5) "Weiss"
[2]=>
string(6) "Goethe"
[3]=>
string(3) "und"
[4]=>
string(5) "Götz"
}
array(8) {
[0]=>
string(13) "Ḽơᶉëᶆ"
[1]=>
string(13) "ȋṕšᶙṁ"
[2]=>
string(14) "ḍỡḽǭᵳ"
[3]=>
string(6) "ʂǐť"
[4]=>
string(11) "ӓṁệẗ"
[5]=>
string(26) "ĉṓɲṩḙċťᶒțûɾ"
[6]=>
string(23) "ấɖḯƥĭṩčįɳġ"
[7]=>
string(9) "ḝłįʈ"
}
如果您需要将输出作为字符串,则可以在使用函数之前对函数的结果使用join
You can run the working code and check the result online
但是您必须记住,这不适用于所有类型的字符串或所有PHP版本
答案 1 :(得分:0)
尝试使用此代码在字符串中查找字符串
$data = "I have a unique idea. Do you need one?";
$find = "a unique idea";
$start = strpos($data, $find);
if($start){
$end = $start + strlen($find);
print_r(substr($data, $start, strlen($find)));
} else {
echo "not found";
}
答案 2 :(得分:0)
这是一种非常简单的方法:
$source = "This is a demo text and I love you about this";
$needle = "I you love";
$words = explode(" " , $source);
$needleWords = explode(" ", $needle);
$results = [];
foreach($needleWords as $key => $needleWord) {
foreach($words as $keyWords => $word) {
if(strcasecmp($word, $needleWord) == 0) {
$results[$keyWords] = $needleWord;
}
}
}
uksort($results, function($a , $b) {
return $a - $b;
});
echo(implode(" " , $results));
输出
I love you