如何找到两个多行字符串之间的相似度百分比?

时间:2017-01-03 05:44:32

标签: java algorithm levenshtein-distance

我有两个多行字符串。我使用以下代码确定其中两个之间的相似性。这使用了Levenshtein距离算法。

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

但是上面的代码没有按预期工作。

例如,假设我们有以下两个字符串s1s2

S1 - &gt; How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2-&GT; How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

然后我将上述字符串传递给相似性方法,但它没有找到确切的差异百分比。如何优化算法?

以下是我的主要方法

更新

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }

3 个答案:

答案 0 :(得分:5)

我可以建议你采取一种方法......

您正在使用编辑距离,它会为您提供S1中需要更改/添加/删除的字符数,以便将其转换为S2。

所以,例如:

S1 = "abc"
S2 = "cde"

编辑距离是3,它们是100%不同(考虑到你通过char比较在某种char中看到它。)

如果你这样做,你可以得到一个大概的百分比

S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())

min是一种解决方法,用于处理字符串非常不同的情况,例如:

S1 = "abc"
S2 = "defghijklmno"

因此编辑距离将大于S1的长度,百分比应该大于100%,因此可能除以更大的尺寸应该更好。

希望有所帮助

答案 1 :(得分:3)

您的similarity方法返回0到1之间的数字(包括两端),其中一个表示字符串相同(编辑距离为零)。

然而,在你的authQuestion方法中,你的行为就像它返回一个0到100之间的数字,由这一行证明:

if(re > 60){

您需要将其更改为

if(re > .6){

if(re * 100 > 60){

答案 2 :(得分:1)

Since you are using your entire S1 in the where clause of your sql query, it will either find a perfect match or won't return any result at all.

As mentioned by @ErwinBolwidt, if it returns nothing then you isQuestionAvailable will always remain false. And if it returns a perfect match then you are bound to get 100% similarity.

What you can do is: Use a substring of your S1 to search for questions that match that part.

You can make following changes:

authQuestion method

checkStmt.setString(1, question.substring(0,20)); //say

Out of the results fetched, you can compare each result with your question for similarity.