余弦相似度计算问题

时间:2015-08-27 18:00:58

标签: c# cosine-similarity

我在计算2个字符串之间的余弦相似性方面存在问题。 我使用函数计算每个字符串的二进制向量格式。它给出了二进制向量,例如(1,1,1,1,1,0,0,0,0)

public static Tuple<int[],int[]> sentence_to_vector(String[] word_array1, String[] word_array2)
{
    String[] unique_word_array1 = word_array1.Distinct().ToArray();
    String[] unique_word_array2 = word_array2.Distinct().ToArray();
    String[] list_all_words = unique_word_array1.Concat(unique_word_array2).ToArray();
    String[] list_all_words_unique = list_all_words.Distinct().ToArray();

    int count_all_unique_words = list_all_words_unique.Length;
    int[] sentence1_vector = new int[count_all_unique_words];
    int[] sentence2_vector = new int[count_all_unique_words];

    for (int i = 0; i < count_all_unique_words; i++)
    {
        if (Array.IndexOf(unique_word_array1, list_all_words_unique[i]) >= 0)
        {
            sentence1_vector[i] = 1;
        }
        else
        {
            sentence1_vector[i] = 0;
        }
    }

    for (int i = 0; i < count_all_unique_words; i++)
    {
        if (Array.IndexOf(word_array2, list_all_words_unique[i]) >= 0)
        {
            sentence2_vector[i] = 1;
        }
        else
        {
            sentence2_vector[i] = 0;
        }
    } 

    return Tuple.Create(sentence1_vector, sentence2_vector);;

}

在计算矢量表示后,我会进行余弦相似度计算。

代码附于此:

public static float get_cosine_similarity(int[] sentence1_vector, int[] sentence2_vector)
{
    int vector_length = sentence1_vector.Length;
    int i = 0;
    float numerator = 0, denominator = 0;
    int temp1 = 0, temp2 = 0;
    double square_root1 = 0, square_root2 = 0;

    for (i = 0; i < vector_length; i++)
    {
        numerator += sentence1_vector[i] * sentence2_vector[i];
        temp1 += sentence1_vector[i] * sentence1_vector[i];
        temp2 += sentence2_vector[i] * sentence2_vector[i];
    }

    //TextWriter tw = new StreamWriter("E://testpdf/date2.txt");
    square_root1 = Math.Sqrt(temp1);
    square_root2 = Math.Sqrt(temp2);
    denominator = (float)(square_root1 * square_root2);

    if (denominator != 0){
        return (float)(numerator / denominator);
        //return (float)(numerator);
    }
    else{
        return 0;
    }
}

我检查了一个网站,在那里我可以指定2个字符串并找到它们之间的余弦相似度。该网站随附:

http://cs.uef.fi/~zhao/Link/Similarity_strings.html

function implementationCosin(){
    var string1 = document.DPAform.str1.value;
    var s1 = stringBlankCheck(string1);

    var string2 = document.DPAform.str2.value;
    var s2 = stringBlankCheck(string2);

    if (s1.length < 1) {
        alert("Please input the string1.");
        return;
    }
    if (s2.length < 1) {
        alert("Please input the string2.");
        return;
    }

    document.DPAform.displayArea2.value = "";

    var sDT = new Date();
   // var begin = new Date().getTime();

    var cosin_similarity_value = consinSimilarity(s1, s2);
    document.DPAform.displayArea2.value += 'Cosin_Similarity(' + s1 + ',' + s2 + ')=' + cosin_similarity_value + '%\n';

    var eDT = new Date();
      var timediff = sDT.dateDiff("ms", eDT);
   // var timediff = (new Date().getTime() - begin);

    document.DPAform.displayArea2.value += "The total escaped time is: " + timediff + " (ms).\n";
}

即使2个句子的0%相似,我的代码也说它们之间有一些相似之处。

0 个答案:

没有答案