Question

我有两个单词列表，假设LIST1和LIST2。我想比较LIST1与LIST2以找到重复项，但它应该找到该单词的复数形式以及形式。例如。

假设LIST1有单词“account”，而LIST2有单词“accounts，accounting”当我做比较时，结果应显示两个匹配单词“account”。

我在PHP中使用它并在mysql表中使用LIST。

Answer 1

您可以使用名为porter stemming的技术将每个列表条目映射到其词干，然后比较词干。可以在here或here中找到PHP中的Porter Stemming算法的实现。

Answer 2

我要做的就是接受你的话并将其直接与LIST2进行比较，同时从你正在比较的每个单词中删除你的单词，寻找左边的，s，es来表示复数或单词（这应该足够准确）。如果不是，你将不得不生成一个用单词制作复数的算法，因为它不像添加S那么简单。

Duplicate Ending List
s
es
ing

LIST1
Gas
Test

LIST2
Gases
Tests
Testing

现在将List1与List2进行比较。在相同的比较循环期间，对项目进行直接比较，并从列表1中的当前单词中删除单词1中的单词。现在只需检查此结果是否在重复结束列表中。

希望这是有道理的。

Answer 3

问题在于，至少在英语中，复数不是所有标准扩展，也不是现在的分词。您可以使用所有单词+'ing'和+'s'进行近似，但这会产生误报和否定。

如果您愿意，可以直接在MySQL中处理。

SELECT DISTINCT l2.word
  FROM LIST1 l1, LIST l2
  WHERE l1.word = l2.word OR l1.word + 's' = l2.word OR l1.word + 'ing' = l2.word;

Answer 4

此功能将输出复数词。

http://www.exorithm.com/algorithm/view/pluralize

类似的东西可以写成动名词和现在的分词（形式）

Answer 5

您可以考虑将Doctrine Inflector类与stemmer结合使用。

这是高水平的算法

在空格上拆分搜索字符串，单独处理字词
小写搜索词
剥离特殊字符
Singularize，用通配符（'％'）

differing portion

Stem，用通配符（'％'）

这是我放在一起的功能

/**
 * Use inflection and stemming to produce a good search string to match subtle
 * differences in a MySQL table.
 *
 * @string $sInputString The string you want to base the search on
 * @string $sSearchTable The table you want to search in
 * @string $sSearchField The field you want to search
 */
function getMySqlSearchQuery($sInputString, $sSearchTable, $sSearchField)
{
    $aInput  = explode(' ', strtolower($sInputString));
    $aSearch = [];
    foreach($aInput as $sInput) {
        $sInput = str_replace("'", '', $sInput);

        //--------------------
        // Inflect
        //--------------------
        $sInflected = Inflector::singularize($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sInflected, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $sInput = substr($sInflected, 0, $iPosition) . '%';
        } else {
            $sInput = $sInput;
        }

        //--------------------
        // Stem
        //--------------------
        $sStemmed = stem_english($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sStemmed, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $aSearch[] = substr($sStemmed, 0, $iPosition) . '%';
        } else {
            $aSearch[] = $sInput;
        }
    }

    $sSearch = implode(' ', $aSearch);
    return "SELECT * FROM $sSearchTable WHERE LOWER($sSearchField) LIKE '$sSearch';";
}

我使用多个测试字符串运行

Input String: Mary's Hamburgers
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'mary% hamburger%';

Input String: Office Supplies
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'offic% suppl%';

Input String: Accounting department
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'account% depart%';

可能不完美，但无论如何这是一个好的开始！如果返回多个匹配，它将落下的位置。确定最佳匹配没有逻辑。这就是MySQL fulltext和Lucene之类的内容。考虑一下，您可以使用levenshtein使用此方法对多个结果进行排名！

比较单词，还需要查找复数和ing？

5 个答案: