如何在java中实现jaccard系数?

时间:2017-04-14 13:26:42

标签: java string function string-comparison

假设我有两个这样的字符串。

查询1:“三月的想法”

查询2:“Ceaser在三月去世”

函数(j)=(Query1交集Query2)/(Query1 union Query2)

我正在考虑令牌(单词)数量的准确性,无论其位置如何。

Query1交叉点Query2 = 1 {March}

Query1 union Query2 = 6 {Ideas,of,March,Ceaser,death,in}

在此上下文中,函数(j)应返回1/6。

无论如何,我能找到两个句子的交集计数和联合计数吗?例如,在这里,

public double calculateSimilarity(String  oneContent,  String otherContent)
{
    double numerator   = intersection(oneContent,otherContet);
    double denominator = union(oneContent,otherContet);

    return denominator.size() > 0 ? 
    (double)numerator.size()/(double)denominator.size() : 0;
}

这些Java中的任何可用函数是否可以在不使用Google Guava等任何外部库的情况下获取交集计数和联合计数?

2 个答案:

答案 0 :(得分:0)

由于您只对联合/交集的大小感兴趣,您可以计算这两个集的大小,而无需实际创建并集和交集(union(a, b).size()a.size() + b.size() - intersection(a, b).size() - >交叉点大小是必需的。)

public static void main(String[] args) {
    final String a = "Ideas of March";
    final String b = "Ceaser died in March";
    final java.util.regex.Pattern p
        = java.util.regex.Pattern.compile("\\s+");
    final double similarity = similarity(
            p.splitAsStream(a).collect(java.util.stream.Collectors.toSet()),
            p.splitAsStream(b).collect(java.util.stream.Collectors.toSet()));
    assert similarity == 1d / 6;
    System.out.println(similarity); // 0.1666...
}

public static double similarity(Set<?> left, Set<?> right) {
    final int sa = left.size();
    final int sb = right.size();
    if ((sa - 1 | sb - 1) < 0)
        return (sa | sb) == 0 ? emptyJaccardSimilarityCoefficient : 0;
    if ((sa + 1 & sb + 1) < 0)
        return parallelSimilarity(left, right);
    final Set<?> smaller = sa <= sb ? left : right;
    final Set<?> larger  = sa <= sb ? right : left;
    int intersection = 0;
    for (final Object element : smaller) try {
        if (larger.contains(element))
            intersection++;
    } catch (final ClassCastException | NullPointerException e) {}
    final long sum = (sa + 1 > 0 ? sa : left.stream().count())
                   + (sb + 1 > 0 ? sb : right.stream().count());
    return 1d / (sum - intersection) * intersection;
}

答案 1 :(得分:-1)