我有两个多行字符串。我使用以下代码确定其中两个之间的相似性。这使用了Levenshtein距离算法。
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) {
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
但是上面的代码没有按预期工作。
例如,假设我们有以下两个字符串s1
和s2
,
S1 - &gt; How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?
S2-&GT; How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?
然后我将上述字符串传递给相似性方法,但它没有找到确切的差异百分比。如何优化算法?
以下是我的主要方法
更新:
public static boolean authQuestion(String question) throws SQLException{
boolean isQuestionAvailable = false;
Connection dbCon = null;
try {
dbCon = MyResource.getConnection();
String query = "SELECT * FROM WORDBANK where WORD ~* ?;";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
double re=similarity( rs.getString("question"), question);
if(re > 0.6){
isQuestionAvailable = true;
}else {
isQuestionAvailable = false;
}
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (dbCon != null)
dbCon.close();
} finally {
if (dbCon != null)
dbCon.close();
}
return isQuestionAvailable;
}
答案 0 :(得分:5)
我可以建议你采取一种方法......
您正在使用编辑距离,它会为您提供S1中需要更改/添加/删除的字符数,以便将其转换为S2。
所以,例如:
S1 = "abc"
S2 = "cde"
编辑距离是3,它们是100%不同(考虑到你通过char比较在某种char中看到它。)
如果你这样做,你可以得到一个大概的百分比
S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())
min是一种解决方法,用于处理字符串非常不同的情况,例如:
S1 = "abc"
S2 = "defghijklmno"
因此编辑距离将大于S1的长度,百分比应该大于100%,因此可能除以更大的尺寸应该更好。
希望有所帮助
答案 1 :(得分:3)
您的similarity
方法返回0到1之间的数字(包括两端),其中一个表示字符串相同(编辑距离为零)。
然而,在你的authQuestion
方法中,你的行为就像它返回一个0到100之间的数字,由这一行证明:
if(re > 60){
您需要将其更改为
if(re > .6){
或
if(re * 100 > 60){
答案 2 :(得分:1)
Since you are using your entire S1 in the where clause of your sql query, it will either find a perfect match or won't return any result at all.
As mentioned by @ErwinBolwidt, if it returns nothing then you isQuestionAvailable
will always remain false.
And if it returns a perfect match then you are bound to get 100% similarity.
What you can do is: Use a substring of your S1 to search for questions that match that part.
You can make following changes:
authQuestion method
checkStmt.setString(1, question.substring(0,20)); //say
Out of the results fetched, you can compare each result with your question for similarity.