我想将两个字符串分类为相似或不相似。例如
s1 = "Token is invalid. DeviceId = deviceId: "345" "
s2 = "Token is invalid. DeviceId = deviceId: "123" "
s3 = "Could not send Message."
我正在寻找一个可以在2个字符串之间给出匹配分数的java库,从中我可以确定它们是否相似。我的程序只需要处理一个小数据集(~2000字符串)。你知道那里有没有可用的东西吗?
答案 0 :(得分:4)
检查Levenshtein距离以获得匹配分数
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java
答案 1 :(得分:1)
对于所有NLP java问题,您应该检查Apache Lucene项目。然而,根据您的需要,需要一个简单的Levenshtein距离算法
答案 2 :(得分:0)
如建议的那样。 Levenshtein距离算法......
public class LevenshteinDistance
{
private static int minimum(int a, int b, int c)
{
return Math.min(Math.min(a, b), c);
}
public static int computeLevenshteinDistance(CharSequence str1, CharSequence str2)
{
int[][] distance = new int[str1.length() + 1][str2.length() + 1];
for (int i = 0; i <= str1.length(); i++)
distance[i][0] = i;
for (int j = 1; j <= str2.length(); j++)
distance[0][j] = j;
for (int i = 1; i <= str1.length(); i++)
for (int j = 1; j <= str2.length(); j++)
distance[i][j] = minimum(distance[i - 1][j] + 1,
distance[i][j - 1] + 1,
distance[i - 1][j - 1] + ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0 : 1));
return distance[str1.length()][str2.length()];
}
public static void main(String[] args)
{
String s1 = "Token is invalid. DeviceId = deviceId: \"345\" ";
String s2 = "Token is invalid. DeviceId = deviceId: \"123\" ";
String s3 = "Could not send Message.";
System.out.println(computeLevenshteinDistance(s1, s2)); // s1 VS. s2
System.out.println(computeLevenshteinDistance(s1, s3)); // s1 VS. s3
System.out.println(computeLevenshteinDistance(s2, s3)); // s2 Vs. s3
}
}