Question

我想将一个句子解析为单词，但有些句子有两个单词可以组合成一个单词并产生不同的含义。

例如：

Eminem是一位嘻哈明星。

如果我通过空格分割单词来解析它，我会得到

Eminem is a **hip** **hop** star

但我想要这样的事情：

Eminem is a **hip hop** star

这只是一个例子;可能还有一些其他单词组合在字典中列为单词。

我如何轻松解析这个？

我在MySQL数据库中有一本字典。有没有API可以做到这一点？

Answer 1

我不知道API。但是，您可以尝试SQL like子句。

$words = explode(' ', 'Eminem is a hip hop star');
$len = count($words);

$fixed = array();

for($x = 0; $x < $len; $x++) {
    //LIKE 'hip %' will match hip hop
    $q = mysql_query("SELECT word FROM dict WHERE word LIKE '".$words[$x]." %'");

    //Combine current and next word
    $combined = $words[$x].' '.$words[($x+1)];

    while( $result = mysql_fetch_array($q)) { 
        if($result['word'] == $combined) {  //Word is in dictionary
            $fixed[] = $combined;
            $x++;
        } else {  //Word isn't in dictionary
            $fixed[] = $words[$x];
        }
    }
}

*请原谅我缺乏PDO。我现在很懒。

编辑：我已经做了一些思考。虽然上面的代码不是最优的，但我提出的优化版本可能无法做得更好。事实上无论你如何处理问题，你都需要将输入句子中的每个单词与字典进行比较并执行其他计算。根据硬件限制，我可以看到两种方法。

这两种方法都假定dict表具有（示例）结构：

+--+-----+------+
|id|first|second|
+--+-----+------+
|01|hip  |hop   |
+--+-----+------+
|02|grade|school|
+--+-----+------+

选项1：您的网络服务器拥有大量可用内存（以及不错的处理器）

这里的想法是通过在PHP的内存中缓存字典来完全绕过数据库层（使用APC或memcache，后者如果计划在多个服务器上运行）。这会将所有负载放在您的Web服务器上，但是由于从RAM访问缓存数据比查询数据库要快得多，因此速度可能会快得多。

（同样，为了简单起见，我省略了PDO和Sanitization）

// Step One: Cache Dictionary..the entire dictionary
//           This could be run on server start-up or before every user input
if(!apc_exists('words')) {
    $words = array();

    $q = mysql_query('SELECT first, second FROM dict');
    while($res = mysql_fetch_array($q)) {
        $words[] = array_values($res);
    }

    apc_store('words', serialize($words)); //You could use memcache if you want
}


// Step Two: Compare cached dictionary to user input
$data = explode(' ', 'Eminem is a hip hop star');
$words = apc_fetch('words');

$count = count($data);
for($x = 0; $x < $count; $x++) { //Simpler to use a for loop
    foreach($words as $word) { //Match against each word
        if($data[$x] == $word[0] && $data[$x+1] == $word[1]) {
            $data[$x] .= ' '.$word[1];
            array_splice($data, $x, 1);
            $count--;
        }
    }
}

选项2：快速SQL Server 第二个选项涉及从SQL服务器查询输入文本中的每个单词。例如，对于句子“Eminem is hip hop”，您将创建一个看起来像SELECT * FROM dict WHERE (first = 'Eminem' && second = 'is') || (first = 'is' && second = 'hip') || (first = 'hip' && second = 'hop')的查询。然后，为了修复单词数组，您只需循环遍历MySQL的结果并将相应的单词融合在一起。如果您愿意采用这种方法，在查询数据库之前缓存常用单词并修复它们可能更有效。这样您就可以消除查询中的条件。

如何使用字典数据库解析单词/短语2个单词（在PHP中）

1 个答案: