计算余弦相似度

时间:2015-03-21 14:21:31

标签: java hashmap cosine-similarity

我正在尝试应用Java类来测量两个长度不同的文档之间的余弦相似度。负责计算此代码的类的代码如下:

public class CosineSimilarityy {
    public Double calculateCosineSimilarity(HashMap<String, Double> firstFeatures, HashMap<String, Double> secondFeatures) {
        Double similarity = 0.0;
        Double sum = 0.0; // the numerator of the cosine similarity
        Double fnorm = 0.0; // the first part of the denominator of the cosine similarity
        Double snorm = 0.0; // the second part of the denominator of the cosine similarity
        Set<String> fkeys = firstFeatures.keySet();
        Iterator<String> fit = fkeys.iterator();
        while (fit.hasNext()) {
            String featurename = fit.next();
            boolean containKey = secondFeatures.containsKey(featurename);
            if (containKey) {
                sum = sum + firstFeatures.get(featurename) * secondFeatures.get(featurename);
            }
        }
        fnorm = calculateNorm(firstFeatures);
        snorm = calculateNorm(secondFeatures);
        similarity = sum / (fnorm * snorm);
        return similarity;
    }

    /**
     * calculate the norm of one feature vector
     *
     * @param feature of one cluster
     * @return
     */
    public Double calculateNorm(HashMap<String, Double> feature) {
        Double norm = 0.0;
        Set<String> keys = feature.keySet();
        Iterator<String> it = keys.iterator();
        while (it.hasNext()) {
            String featurename = it.next();
            norm = norm + Math.pow(feature.get(featurename), 2);
        }
        return Math.sqrt(norm);
    }
}

然后我构造了这个类的一个实例,创建了两个HashMap并将每个文档分配给这些符号。然后,当我尝试应用计算时,如果它们是相同的,则结果为1.0,这是正确的,但如果它们之间存在任何细微差别,则无论如何,结果始终为零。我错过了什么?

public static void main(String[] args) {
    // TODO code application logic here

    CosineSimilarityy test = new CosineSimilarityy();
    HashMap<String, Double> hash = new HashMap<>();
    HashMap<String, Double> hash2 = new HashMap<>();
    hash.put("i am a book", 1.0);
    hash2.put("you are a book", 2.0);
    double result;
    result = test.calculateCosineSimilarity(hash, hash2);
    System.out.println(" this is the result: " + result);
}

原始代码取自here

1 个答案:

答案 0 :(得分:2)

首先,我认为“我是一本书”被视为一个单一的特征。要进行比较,您必须首先使用空格作为分隔符来拆分比较的字符串。接下来,您必须使用从书名中提取的相应单词填充哈希映射。然后,您可以测试算法是否正常工作。

How do i split a string with any whitespace chars as delimiters?

Cosine similiarity wikipedia