Question

我有以下工作Java代码，用于搜索单词列表中的单词，它可以完美地运行并且符合预期：

public class Levenshtein {
    private int[][] wordMartix;

    public Set similarExists(String searchWord) {

        int maxDistance = searchWord.length();
        int curDistance;
        int sumCurMax;
        String checkWord;

        // preventing double words on returning list
        Set<String> fuzzyWordList = new HashSet<>();

        for (Object wordList : Searcher.wordList) {
            checkWord = String.valueOf(wordList);
            curDistance = calculateDistance(searchWord, checkWord);
            sumCurMax = maxDistance + curDistance;
            if (sumCurMax == checkWord.length()) {
                fuzzyWordList.add(checkWord);
            }
        }
        return fuzzyWordList;
    }

    public int calculateDistance(String inputWord, String checkWord) {
        wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];

        for (int i = 0; i <= inputWord.length(); i++) {
            wordMartix[i][0] = i;
        }

        for (int j = 0; j <= checkWord.length(); j++) {
            wordMartix[0][j] = j;
        }

        for (int i = 1; i < wordMartix.length; i++) {
            for (int j = 1; j < wordMartix[i].length; j++) {
                if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
                    wordMartix[i][j] = wordMartix[i - 1][j - 1];
                } else {
                    int minimum = Integer.MAX_VALUE;
                    if ((wordMartix[i - 1][j]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j]) + 1;
                    }

                    if ((wordMartix[i][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i][j - 1]) + 1;
                    }

                    if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j - 1]) + 1;
                    }

                    wordMartix[i][j] = minimum;
                }
            }
        }

        return wordMartix[inputWord.length()][checkWord.length()];
    }

}

现在，当我搜索job之类的单词时，它会返回一个列表：

输出

joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer

正如你所看到的那样，输出有很多相关的词，但也有不相关的词，如jakob，jacob等，这对于Levenshtein公式是正确的，但我想进一步构建并编写一个方法，可以微调我的搜索，以便我可以获得更多相关和相关的单词。

我已经工作了几个小时而且失去了创造力。

我的问题：是否可以微调现有方法以返回相关/相关字词或者我应该采取另一种方法或???在所有情况下是或否，我很欣赏能否获得关于改进搜索结果的输入和灵感？

更新

长时间回答这个问题之后我还没有真正找到解决方案，我回到它，因为现在是时候我需要一个有用的答案，可以用JAVA代码样本提供答案，但是最重要的是一个详细的答案，描述了用于索引最佳和最相关的搜索结果的可用方法和方法，并忽略了没有相关的单词。我知道这是一个开放和无穷无尽的领域，但我需要一些灵感来开始一些地方。

注意：现在最老的答案是基于其中一个评论输入而且是没有帮助（没用），它只是对距离进行排序，这并不意味着获得更好的搜索结果/质量。

所以我进行了距离排序，结果是这样的：

job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber

所以word jobbaser是相关的，jacob / jakob不相关，但jobbaser的距离大于jacob / jakob。所以这并没有真正帮助。

有关答案的一般反馈

@SergioMontoro，它几乎解决了这个问题。
@uSeemSurprised，它解决了问题但需要不断操纵。
@Gene概念非常好，但它正在转发外部网址。

感谢我个人感谢所有为这个问题做出贡献的人，我得到了很好的答案和有用的评论。

特别感谢@ SergioMontoro，@ uememSurprised和@Gene的答案，这些答案是不同但有效且有用的答案。

@D.Kovács指出了一些有趣的解决方案。

我希望我能给予所有这些答案赏金。选择一个答案并给予赏金，这并不意味着其他答案无效，但这只意味着我选择的特定答案对我有用。

Answer 1

如果不理解@DrYap建议的单词的含义，那么比较两个单词的下一个逻辑单元（如果你不是在寻找拼写错误）就是音节。修改Levenshtein以比较音节而不是字符非常容易。困难的部分是将单词分解为音节。有一个Java实现TeXHyphenator-J可用于拆分单词。基于这个连字库，这里是由Michael Gilleland & Chas Emerick编写的Levenshtein函数的修改版本。有关音节检测here和here的更多信息。当然，你要避免使用标准Levenshtein来处理这个案例的两个单音节词的音节比较。

import net.davidashen.text.Hyphenator;

public class WordDistance {

    public static void main(String args[]) throws Exception {
        Hyphenator h = new Hyphenator();
        h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
        getSyllableLevenshteinDistance(h, args[0], args[1]);
    }

    /**
     * <p>
     * Calculate Syllable Levenshtein distance between two words </p>
     * The Syllable Levenshtein distance is defined as the minimal number of
     * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
     * @return int
     * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
     */
    public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
        if (s == null || t == null)
            throw new NullPointerException("Strings must not be null");

        final String hyphen = Character.toString((char) 173);
        final String[] ss = h.hyphenate(s).split(hyphen);
        final String[] st = h.hyphenate(t).split(hyphen);

        final int n = ss.length;
        final int m = st.length;

        if (n == 0)
            return m;
        else if (m == 0)
            return n;

        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally

        for (int i = 0; i <= n; i++)
            p[i] = i;

        for (int j = 1; j <= m; j++) {
            d[0] = j;
            for (int i = 1; i <= n; i++) {
                int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }
            // copy current distance counts to 'previous row' distance counts
            int[] _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
        return p[n];
    }

}

Answer 2

您可以通过在连续字符匹配时调整评分来修改Levenshtein距离。

只要有连续的符号匹配，就可以减少分数，从而使搜索更加相关。

例如：让我们说我们想要减少得分的因子是10然后如果总之我们发现子串“作业”我们可以将得分减少10当我们遇到“j”时将其减少（10） + 20）当我们找到字符串“jo”时，当我们找到“job”时最终将得分减少（10 + 20 + 30）。

我在下面写了一个c ++代码：

#include <bits/stdc++.h>

#define INF -10000000
#define FACTOR 10

using namespace std;

double memo[100][100][100];

double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
    if(i == inputWord.length() && j == checkWord.length()) return 0;    
    if(i == inputWord.length()) return checkWord.length() - j;
    if(j == checkWord.length()) return inputWord.length() - i;
    if(memo[i][j][count] != INF) return memo[i][j][count];

    double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
    if(inputWord[i] == checkWord[j]){
        ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
        ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, min(ans2, ans3));
    }else{
        ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, ans2);
    }
    return memo[i][j][count] = ans;
}

int main(void) {
    // your code goes here
    string word = "job";
    string wordList[40];
    vector< pair <double, string> > ans;
    for(int i = 0;i < 40;i++){
        cin >> wordList[i];
        for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
            for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
        }
        ans.push_back( make_pair(Levenshtein(word, wordList[i], 
            0, 0, 0), wordList[i]) );
    }
    sort(ans.begin(), ans.end());
    for(int i = 0;i < ans.size();i++){
        cout << ans[i].second << " " << ans[i].first << endl;
    }
    return 0;
}

演示链接：http://ideone.com/4UtCX3

此处FACTOR为10，您可以尝试其他单词并选择合适的值。

另请注意，上述Levenshtein距离的复杂性也有所提高，现在是O(n^3)而不是O(n^2)，因为现在我们也在跟踪计算我们有多少个连续字符的计数器遇到。

您可以在找到一些连续的子串然后不匹配后逐渐增加分数，而不是当前我们将固定分数1添加到总分中的方式。

同样在上面的解决方案中，您可以删除分数> = 0的字符串，因为它们根本不是副本，您也可以选择其他阈值来获得更准确的搜索。

Answer 3

既然你问过，我将展示here在这种事情上的表现。不确定它是你真正想要的：

import static java.lang.String.format;
import static java.util.Comparator.comparingDouble;
import static java.util.stream.Collectors.toMap;
import static java.util.function.Function.identity;

import java.util.Map.Entry;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.regex.Pattern;

public class SemanticSimilarity {
  private static final String GET_URL_FORMAT
      = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
          + "operation=api&phrase1=%s&phrase2=%s";
  private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\\w+");
  private static final String[] DICT = {
    "cat",
    "building",
    "girl",
    "ranch",
    "drawing",
    "wool",
    "gear",
    "question",
    "information",
    "tank" 
  };

  public static String httpGetLine(String urlToRead) throws IOException {
    URL url = new URL(urlToRead);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    conn.setRequestMethod("GET");
    try (BufferedReader reader = new BufferedReader(
        new InputStreamReader(conn.getInputStream()))) {
      return reader.readLine();
    }
  }

  public static double getSimilarity(String a, String b) {
    if (!VALID_WORD_PATTERN.matcher(a).matches()
        || !VALID_WORD_PATTERN.matcher(b).matches()) {
      throw new RuntimeException("Bad word");
    }
    try {
      return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
    } catch (IOException | NumberFormatException ex) {
      return -1.0;
    }
  }

  public static void test(String target) throws IOException {
    System.out.println("Target: " + target);
    Arrays.stream(DICT)
        .collect(toMap(identity(), word -> getSimilarity(target, word)))
        .entrySet().stream()
        .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
        .forEach(System.out::println);
    System.out.println();
  }

  public static void main(String[] args) throws Exception {
    test("sheep");
    test("vehicle");
    test("house");
    test("data");
    test("girlfriend");
  }
}

结果有点吸引人：

Target: sheep
ranch=0.38563728
cat=0.37816614
wool=0.36558008
question=0.047607
girl=0.0388761
information=0.027191084
drawing=0.0039623436
tank=0.0
building=0.0
gear=0.0

Target: vehicle
tank=0.65860236
gear=0.2673374
building=0.20197356
cat=0.06057514
information=0.041832563
ranch=0.017701812
question=0.017145569
girl=0.010708235
wool=0.0
drawing=0.0

Target: house
building=1.0
ranch=0.104496084
tank=0.103863
wool=0.059761923
girl=0.056549154
drawing=0.04310725
cat=0.0418914
gear=0.026439993
information=0.020329408
question=0.0012588014

Target: data
information=0.9924584
question=0.03476312
gear=0.029112043
wool=0.019744944
tank=0.014537057
drawing=0.013742204
ranch=0.0
cat=0.0
girl=0.0
building=0.0

Target: girlfriend
girl=0.70060706
ranch=0.11062875
cat=0.09766617
gear=0.04835723
information=0.02449007
wool=0.0
question=0.0
drawing=0.0
tank=0.0
building=0.0

Answer 4

我尝试了关于根据Levenshtein algo返回的距离对匹配进行排序的评论中的建议，似乎确实产生了更好的结果。

（由于我无法找到我从代码中找不到Searcher类的原因，我冒昧地使用了不同的词汇表，Levenshtein实现和语言。）

使用Ubuntu中提供的单词列表，以及来自 - https://github.com/ztane/python-Levenshtein的Levenshtein algo实现，我创建了一个小脚本，要求输入一个单词并将所有最接近的单词和距离打印为元组。

代码 - https://gist.github.com/atdaemon/9f59ad886c35024bdd28

from Levenshtein import distance
import os

def read_dict() :
    with open('/usr/share/dict/words','r') as f : 
        for line in f :
            yield str(line).strip()

inp = str(raw_input('Enter a word : '))

wordlist = read_dict()
matches = []
for word in wordlist :
    dist = distance(inp,word)
    if dist < 3 :
        matches.append((dist,word))
print os.linesep.join(map(str,sorted(matches)))

示例输出 -

Enter a word : job
(0, 'job')
(1, 'Bob')
(1, 'Job')
(1, 'Rob')
(1, 'bob')
(1, 'cob')
(1, 'fob')
(1, 'gob')
(1, 'hob')
(1, 'jab')
(1, 'jib')
(1, 'jobs')
(1, 'jog')
(1, 'jot')
(1, 'joy')
(1, 'lob')
(1, 'mob')
(1, 'rob')
(1, 'sob')
...

Enter a word : checker
(0, 'checker')
(1, 'checked')
(1, 'checkers')
(2, 'Becker')
(2, 'Decker')
(2, 'cheaper')
(2, 'cheater')
(2, 'check')
(2, "check's")
(2, "checker's")
(2, 'checkered')
(2, 'checks')
(2, 'checkup')
(2, 'cheeked')
(2, 'cheekier')
(2, 'cheer')
(2, 'chewer')
(2, 'chewier')
(2, 'chicer')
(2, 'chicken')
(2, 'chocked')
(2, 'choker')
(2, 'chucked')
(2, 'cracker')
(2, 'hacker')
(2, 'heckler')
(2, 'shocker')
(2, 'thicker')
(2, 'wrecker')

Answer 5

这确实是一个开放式问题，但我建议采用另一种方法，例如使用Smith-Waterman algorithm中描述的this SO。

另一种（更轻量级）解决方案是使用NLP中的其他距离/相似性指标（例如，Cosine similarity或Damerau–Levenshtein distance）。

使用Java中的Levenshtein距离改善搜索结果

5 个答案: