假设我有两个这样的字符串。
查询1:“三月的想法”
查询2:“Ceaser在三月去世”
函数(j)=(Query1交集Query2)/(Query1 union Query2)
我正在考虑令牌(单词)数量的准确性,无论其位置如何。
Query1交叉点Query2 = 1 {March}
Query1 union Query2 = 6 {Ideas,of,March,Ceaser,death,in}
在此上下文中,函数(j)应返回1/6。
无论如何,我能找到两个句子的交集计数和联合计数吗?例如,在这里,
public double calculateSimilarity(String oneContent, String otherContent)
{
double numerator = intersection(oneContent,otherContet);
double denominator = union(oneContent,otherContet);
return denominator.size() > 0 ?
(double)numerator.size()/(double)denominator.size() : 0;
}
这些Java中的任何可用函数是否可以在不使用Google Guava等任何外部库的情况下获取交集计数和联合计数?
答案 0 :(得分:0)
由于您只对联合/交集的大小感兴趣,您可以计算这两个集的大小,而无需实际创建并集和交集(union(a, b).size()
仅a.size() + b.size() - intersection(a, b).size()
- >交叉点大小是必需的。)
public static void main(String[] args) {
final String a = "Ideas of March";
final String b = "Ceaser died in March";
final java.util.regex.Pattern p
= java.util.regex.Pattern.compile("\\s+");
final double similarity = similarity(
p.splitAsStream(a).collect(java.util.stream.Collectors.toSet()),
p.splitAsStream(b).collect(java.util.stream.Collectors.toSet()));
assert similarity == 1d / 6;
System.out.println(similarity); // 0.1666...
}
public static double similarity(Set<?> left, Set<?> right) {
final int sa = left.size();
final int sb = right.size();
if ((sa - 1 | sb - 1) < 0)
return (sa | sb) == 0 ? emptyJaccardSimilarityCoefficient : 0;
if ((sa + 1 & sb + 1) < 0)
return parallelSimilarity(left, right);
final Set<?> smaller = sa <= sb ? left : right;
final Set<?> larger = sa <= sb ? right : left;
int intersection = 0;
for (final Object element : smaller) try {
if (larger.contains(element))
intersection++;
} catch (final ClassCastException | NullPointerException e) {}
final long sum = (sa + 1 > 0 ? sa : left.stream().count())
+ (sb + 1 > 0 ? sb : right.stream().count());
return 1d / (sum - intersection) * intersection;
}
答案 1 :(得分:-1)
您可以使用Apache commons文本,该文本没有其他外部依赖项。 (https://commons.apache.org/proper/commons-text/)
你可以在这里找到Jaccard Coefficient实现: https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/JaccardDistance.java