字符串关键短语匹配

时间:2013-09-17 04:55:19

标签: php regex levenshtein-distance

在levenstein how are you中,hw r uhow are uhw ar you可以比较相同,

无论如何我能实现这个目标

如果我有一个类似的短语。

短语

  嗨,我的名字是john doe。我住在纽约。你叫什么名字?

短语

  

我的名字是布鲁斯。这是你的名字

关键词

  

你叫什么名字

响应

  

我的名字是蝙蝠侠。

我从user获取输入。我有一个表,其中包含可能的响应请求列表。例如,用户会询问“它的名字”,有没有办法可以检查一个句子是否有一个像What is your name这样的关键短语,如果它发现它会返回可能的回复

phrase = ' hi, my name is john doe. I live in new york. What is your name?'

//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;
}

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;
}

有没有办法实现这一目标。 levenstein只有在比较字符串的长度不长的情况下才能完美。

  

嗨,这是你的名字

     

我的名字是蝙蝠侠。

但如果这么久

  嗨,我的名字是john doe。我住在纽约。你叫什么名字?

效果不佳。如果有较短的短语,它将识别距离较短并返回错误答案的较短短语

我在想另一种方法是检查一些关键短语。所以有任何想法来实现这个吗?

我正在做类似这样的事情,但也许我认为有一种更好,更正确的方式

$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';
  1. 获取keyPhrase的第一个字符。那将是'W'迭代
  2. $samplePhrase个字符,并与keyPhrase
  3. 的第一个字符进行比较
  4. h,i, ,i,m, ,s,p等。 。
  5. if keyPhrase.char = samplePhrase.currentChar
  6. get keyPhrase.length
  7. 获取samplePhrase.currentChar索引
  8. 将currentPhrase的子字符串基于currentChar索引改为keyPhrase.length
  9. 它将获得的第一个将是work at krabby pa
  10. 使用levenstiens距离比较work at krabby pa到$ keyPhrase('你叫什么名字')
  11. 并检查它最好使用semilar_text。 11.如果不相等,距离是重复过程。

4 个答案:

答案 0 :(得分:1)

我的建议是从关键短语生成一个n-gram列表,并计算每个n-gram与关键短语之间的编辑距离。

示例:

key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"

可能匹配的n-gram长度在3到4个单词之间,因此我们为每个短语创建所有3-gram和4-gram,我们还应该通过删除标点符号并降低所有内容来规范化字符串。

phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

接下来你可以在每个n-gram上做levenstein距离,这应该解决你上面提到的用例。如果您需要进一步规范化每个单词,您可以使用双音节编码器或NYSIIS等语音编码器,但是,我对所有“常用”语音编码器进行了测试,在您的情况下它没有显示出显着的改进,语音编码器更多适合姓名。

我对PHP的经验有限,但这是一个代码示例:

<?php
function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
        }
    }
    return array_unique($ngrams);
}

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
        }
    }
    return false;
}

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
        );
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
    }
}
?>

输出是这样的:

Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name?
found match: what is your name
Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name]
Calculating N-Grams for phrase: My name is Bruce. wht's your name
found match: wht's your name
Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]

编辑:我注意到了一些建议,即在生成的n-gram中为每个单词添加语音编码。我不确定拼音编码是解决这个问题的最佳方法,因为它们主要是根据算法调整名称(美国,德国或法国),并且不善于扼杀简单的单词。

我实际上写了一个测试来用Java验证这个(因为编码器更容易获得)这里是输出:

===========================
Created new phonetic matcher
    Engine: Caverphone2
    Key Phrase: what is your name
    Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111
Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: DoubleMetaphone
    Key Phrase: what is your name
    Encoded Key Phrase: AT AS AR NM
Found match: [What is your] Encoded: AT AS AR
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: ATS AR NM
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Nysiis
    Key Phrase: what is your name
    Encoded Key Phrase: WAT I YAR NAN
Found match: [What is your name?] Encoded: WAT I YAR NAN
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: WT YAR NAN
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Soundex
    Key Phrase: what is your name
    Encoded Key Phrase: W300 I200 Y600 N500
Found match: [What is your name?] Encoded: W300 I200 Y600 N500
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: RefinedSoundex
    Key Phrase: what is your name
    Encoded Key Phrase: W06 I03 Y09 N8080
Found match: [What is your name?] Encoded: W06 I03 Y09 N8080
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: W063 Y09 N8080
Phrase: [My name is Bruce. wht's your name] MATCH: true

我在运行这些测试时使用的levenshtein距离为4,但我确信您可以找到多个边缘情况,其中使用拼音编码器将无法正确匹配。通过查看示例,您可以看到,由于编码器的干扰,您实际上更有可能在以这种方式使用它们时出现误报。请记住,这些算法最初的目的是在人口普查中找到那些具有相同名称而不是真正英语单词“声音”相同的人。

答案 1 :(得分:1)

您要实现的是一项非常复杂的自然语言处理任务,通常需要parsing等。

我要建议的是创建一个将短语分成句子的句子tokenizer。然后将每个句子分解为空格,标点符号,并可能还将一些缩写重写为更正常的形式。

然后,您可以创建自定义逻辑,遍历每个句子的标记列表,寻找特定含义。例如:[&#39; ...&#39;,&#39;什么&#39;,&#39; ...&#39;,&#39; ...&#39;,& #39;的&#39;&#39;名称&#39;&#39; ...&#39;&#39; ...&#39;&#39;&#39; ]也可能意味着你的名字是什么。句子可能是&#34;那么,你的名字到底是什么?&#34;或者&#34;你的名字是什么?&#34;

我正在添加代码作为示例。我不是说你应该使用那么简单的东西。下面的代码使用NlpTools php中的自然语言处理库(我参与了库,所以可以随意假设我有偏见)。

 <?php

 include('vendor/autoload.php');

 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
 {
     public function classify(array $classes, Document $d)
     {
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';
     }
 }

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         '\'s'=>'is',
         'n\'t'=>'not',
         'ur'=>'your'
         // .... more ....
     );

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;
 }

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),
     $whitespace_tok
 );

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
         'normalize',
         $words
     );
     $words = call_user_func_array(
         'array_merge',
         array_map(
             array($punct_tok,'tokenize'),
             $words
         )
     );

     // decide what this sequence of tokens is
     print_r($words);
 }

答案 2 :(得分:0)

首先修复所有短代码示例,其中包含什么内容

$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);

同样添加任意多个短语。 现在只需检查本文中所有可能的问题&amp;得到他们的位置

if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;
}

答案 3 :(得分:0)

您可以考虑使用soundex函数将输入字符串转换为语音等效写入,然后继续搜索。 soundex