我在计算2个字符串之间的余弦相似性方面存在问题。
我使用函数计算每个字符串的二进制向量格式。它给出了二进制向量,例如(1,1,1,1,1,0,0,0,0)
。
public static Tuple<int[],int[]> sentence_to_vector(String[] word_array1, String[] word_array2)
{
String[] unique_word_array1 = word_array1.Distinct().ToArray();
String[] unique_word_array2 = word_array2.Distinct().ToArray();
String[] list_all_words = unique_word_array1.Concat(unique_word_array2).ToArray();
String[] list_all_words_unique = list_all_words.Distinct().ToArray();
int count_all_unique_words = list_all_words_unique.Length;
int[] sentence1_vector = new int[count_all_unique_words];
int[] sentence2_vector = new int[count_all_unique_words];
for (int i = 0; i < count_all_unique_words; i++)
{
if (Array.IndexOf(unique_word_array1, list_all_words_unique[i]) >= 0)
{
sentence1_vector[i] = 1;
}
else
{
sentence1_vector[i] = 0;
}
}
for (int i = 0; i < count_all_unique_words; i++)
{
if (Array.IndexOf(word_array2, list_all_words_unique[i]) >= 0)
{
sentence2_vector[i] = 1;
}
else
{
sentence2_vector[i] = 0;
}
}
return Tuple.Create(sentence1_vector, sentence2_vector);;
}
在计算矢量表示后,我会进行余弦相似度计算。
代码附于此:
public static float get_cosine_similarity(int[] sentence1_vector, int[] sentence2_vector)
{
int vector_length = sentence1_vector.Length;
int i = 0;
float numerator = 0, denominator = 0;
int temp1 = 0, temp2 = 0;
double square_root1 = 0, square_root2 = 0;
for (i = 0; i < vector_length; i++)
{
numerator += sentence1_vector[i] * sentence2_vector[i];
temp1 += sentence1_vector[i] * sentence1_vector[i];
temp2 += sentence2_vector[i] * sentence2_vector[i];
}
//TextWriter tw = new StreamWriter("E://testpdf/date2.txt");
square_root1 = Math.Sqrt(temp1);
square_root2 = Math.Sqrt(temp2);
denominator = (float)(square_root1 * square_root2);
if (denominator != 0){
return (float)(numerator / denominator);
//return (float)(numerator);
}
else{
return 0;
}
}
我检查了一个网站,在那里我可以指定2个字符串并找到它们之间的余弦相似度。该网站随附:
http://cs.uef.fi/~zhao/Link/Similarity_strings.html
function implementationCosin(){
var string1 = document.DPAform.str1.value;
var s1 = stringBlankCheck(string1);
var string2 = document.DPAform.str2.value;
var s2 = stringBlankCheck(string2);
if (s1.length < 1) {
alert("Please input the string1.");
return;
}
if (s2.length < 1) {
alert("Please input the string2.");
return;
}
document.DPAform.displayArea2.value = "";
var sDT = new Date();
// var begin = new Date().getTime();
var cosin_similarity_value = consinSimilarity(s1, s2);
document.DPAform.displayArea2.value += 'Cosin_Similarity(' + s1 + ',' + s2 + ')=' + cosin_similarity_value + '%\n';
var eDT = new Date();
var timediff = sDT.dateDiff("ms", eDT);
// var timediff = (new Date().getTime() - begin);
document.DPAform.displayArea2.value += "The total escaped time is: " + timediff + " (ms).\n";
}
即使2个句子的0%相似,我的代码也说它们之间有一些相似之处。