我正在尝试应用Java类来测量两个长度不同的文档之间的余弦相似度。负责计算此代码的类的代码如下:
public class CosineSimilarityy {
public Double calculateCosineSimilarity(HashMap<String, Double> firstFeatures, HashMap<String, Double> secondFeatures) {
Double similarity = 0.0;
Double sum = 0.0; // the numerator of the cosine similarity
Double fnorm = 0.0; // the first part of the denominator of the cosine similarity
Double snorm = 0.0; // the second part of the denominator of the cosine similarity
Set<String> fkeys = firstFeatures.keySet();
Iterator<String> fit = fkeys.iterator();
while (fit.hasNext()) {
String featurename = fit.next();
boolean containKey = secondFeatures.containsKey(featurename);
if (containKey) {
sum = sum + firstFeatures.get(featurename) * secondFeatures.get(featurename);
}
}
fnorm = calculateNorm(firstFeatures);
snorm = calculateNorm(secondFeatures);
similarity = sum / (fnorm * snorm);
return similarity;
}
/**
* calculate the norm of one feature vector
*
* @param feature of one cluster
* @return
*/
public Double calculateNorm(HashMap<String, Double> feature) {
Double norm = 0.0;
Set<String> keys = feature.keySet();
Iterator<String> it = keys.iterator();
while (it.hasNext()) {
String featurename = it.next();
norm = norm + Math.pow(feature.get(featurename), 2);
}
return Math.sqrt(norm);
}
}
然后我构造了这个类的一个实例,创建了两个HashMap
并将每个文档分配给这些符号。然后,当我尝试应用计算时,如果它们是相同的,则结果为1.0,这是正确的,但如果它们之间存在任何细微差别,则无论如何,结果始终为零。我错过了什么?
public static void main(String[] args) {
// TODO code application logic here
CosineSimilarityy test = new CosineSimilarityy();
HashMap<String, Double> hash = new HashMap<>();
HashMap<String, Double> hash2 = new HashMap<>();
hash.put("i am a book", 1.0);
hash2.put("you are a book", 2.0);
double result;
result = test.calculateCosineSimilarity(hash, hash2);
System.out.println(" this is the result: " + result);
}
原始代码取自here。
答案 0 :(得分:2)
首先,我认为“我是一本书”被视为一个单一的特征。要进行比较,您必须首先使用空格作为分隔符来拆分比较的字符串。接下来,您必须使用从书名中提取的相应单词填充哈希映射。然后,您可以测试算法是否正常工作。
How do i split a string with any whitespace chars as delimiters?