apache.commons.text余弦距离

时间:2017-02-21 08:22:09

标签: java apache-commons cosine-similarity

我正在尝试使用apache commons的余弦距离类。但它总是返回1.0。我错过了吗?这是我的代码:

public class ComputeDistance {
    public static void main(String[] args)throws Exception{

        CosineDistance dist = new CosineDistance();
        CharSequence c1 = "example text1";
        CharSequence c2 = "another file";
        System.out.println(dist.apply(c1,c2));
    }
}

1 个答案:

答案 0 :(得分:1)

CosineDistance返回1 - cosineSimilarity(leftVector, rightVector)leftVectorrightVector是字符的映射和char序列中的出现次数,因此是cosineSimilarity(leftVector, rightVector) = 0的结果。您可以更改代码以使用您的char序列的字符而不是单词:

public class ComputeDistance {
  public static void main(String[] args) throws Exception {

    CosineSimilarity dist = new CosineSimilarity();

    String c1 = "example text1";
    String c2 = "another file";

    Map<CharSequence, Integer> leftVector =
        Arrays.stream(c1.split(""))
        .collect(Collectors.toMap(c -> c, c -> 1, Integer::sum));
    Map<CharSequence, Integer> rightVector =
        Arrays.stream(c2.split(""))
        .collect(Collectors.toMap(c -> c, c -> 1, Integer::sum));

    System.out.println(1 - dist.cosineSimilarity(leftVector,rightVector));

  }
}