Levenshtein距离的同义词

时间:2016-06-13 12:05:43

标签: c# levenshtein-distance synonym

这是我的代码:

public void SearchWordSynonymsByLevenstein()
{
    foreach (var eachWord in wordCounter)
    {
        foreach (var eachSecondWord in wordCounter)
        {
            if (eachWord.Key.Length > 3)
            {
                var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
                if (score < 2)
                {
                    if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
                    {
                        if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
                        {
                            wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
                        }
                        else
                        {
                            wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
                        }
                    }
                }
            }
        }
    }
}

我的wordCounterDictionary<string, int>,其中key是我的每个单词,值是计算文档中该单词的数量。像Bag之类的东西。我必须从其他eachWord搜索eachSecondWord的同义词。这种方法花费了太多时间。时间呈指数增长。还有其他方法可以减少时间吗?

1 个答案:

答案 0 :(得分:1)

首先,我假设您不想在wordSynonymsByLevenstein集合中将单词与自身关联起来。其次,你可以跳过那些你知道不会满足你的&lt;通过比较单词的长度得出2分。

public void SearchWordSynonymsByLevenstein()
{
    foreach (var eachWord in wordCounter)
    {
        foreach (var eachSecondWord in wordCounter)
        {
            if (eachWord.Key == eachSecondWord.Key 
                || eachWord.Key.Length <= 3 
                || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2)
            {
                continue;
            }
            var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
            if (score >= 2)
            {
                continue;
            }

            if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
            {
                if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
                {
                    wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
                }
                else
                {
                    wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
                }
            }

        }
    }
}

您使用if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))表达的要求并不是特别明显或直截了当,但如果您不想要与多个相关联的单词,那么您还可以添加{{1}当你关联单词时,将它们添加到HashSet<string>并在继续之前检查下一个单词是否在那里,而不是迭代嵌套的词典。

HashSet

我在这里使用了public void SearchWordSynonymsByLevenstein() { var used = new HashSet<string>(); foreach (var eachWord in wordCounter) { foreach (var eachSecondWord in wordCounter) { if (eachWord.Key == eachSecondWord.Key || eachWord.Key.Length <= 3 || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2) { continue; } var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key); if (score >= 2) { continue; } if(used.Add(eachSecondWord.Key))) { if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key)) { wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } }); } else { wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value); } } } } } ,因为如果添加了该字词if(used.Add(eachSecondWord.Key)))将返回Add,如果true已经在false,则会HashSet