Question

所以，我已经用Java编写了一个拼写检查程序，并且可以正常工作。唯一的问题是，如果我使用一个单词，其中编辑的最大允许距离太大（比如说，9）那么我的代码就会耗尽内存。我已经分析了我的代码并将堆转储到一个文件中，但我不知道如何使用它来优化我的代码。

任何人都可以提供任何帮助吗？我非常愿意提出文件/使用人们可能拥有的任何其他方法。

- 编辑 -

许多人在评论中要求提供更多细节。我认为其他人会发现它们很有用，它们可能会被埋没在评论中。他们在这里：

我正在使用Trie来存储单词。
为了提高时间效率，我不预先计算Levenshtein距离，但是我按照计算来计算它。我的意思是我在内存中只保留了两行LD表。由于Trie是一个前缀树，这意味着每次我向下递归一个节点时，该单词的前一个字母（以及这些单词的距离）保持不变。因此，我只计算包含新字母的距离，前一行保持不变。
我生成的建议存储在HashMap中。 LD表的行存储在ArrayLists中。

以下是导致问题的Trie中的函数代码。构建Trie非常简单，我在这里没有包含相同的代码。

/*
 * @param letter: the letter that is currently being looked at in the trie
 *        word: the word that we are trying to find matches for
 *        previousRow: the previous row of the Levenshtein Distance table
 *        suggestions: all the suggestions for the given word
 *        maxd: max distance a word can be from th query and still be returned as suggestion
 *        suggestion: the current suggestion being constructed
 */


public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){

// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);

int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;

for(int i=1;i<word.size()+1;i++){
    delete = currentRow.get(i-1)+1;
    insert = previousRow.get(i)+1;

    if(word.get(i-1)==letter)
    swap = previousRow.get(i-1);
    else
    swap = previousRow.get(i-1)+1;

    d = Math.min(delete, Math.min(insert, swap));
    currentRow.add(d);
}

// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
    suggestions.add(suggestion);
    }

// if any of the entries in the current row are <=maxd, it means we can still find possible solutions. 
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
    if(currentRow.get(i)<=maxd){
    for(int j=0;j<26;j++){
        if(children[j]!=null){
        children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97))); 
        }
    }
    break;
    }   
}
}

Answer 1

这是我快速制作的一些代码，展示了一种生成候选人的方法，然后对其进行“排名”。

诀窍是：你永远不会“测试”一个无效的候选人。

对我而言：“当我的编辑距离为9”时，我的内存耗尽尖叫“组合爆炸”。

当然，为了躲避组合爆炸，你不要做自己想做的事情，比如你拼错的作品中距离'9'的所有单词。你从错误拼写的单词开始，生成（相当多）可能的候选人，但你不要创造太多的候选人，因为那时你会遇到麻烦。

（另请注意，计算Levenhstein编辑距离为9时没有多大意义，因为从技术上讲，任何少于10个字母的单词都可以转换为最多9个字母中少于10个字母的单词。变换）

这就是为什么你只是不能测试所有9到9之间的单词而没有OutOfMemory错误或只是程序永远不会终止：

为“ptmizing”一词生成所有LED，最多只增加一个字母（从a到z），生成9 * 26个变量（即324个变体）[有9个]您可以插入26个字母中的一个的位置）
生成所有LED最多2个，只添加一个字母，我们知道已生成10 * 26 * 324个变种（60 840）
生成最多3个LED：17 400 240个变种

这只是，考虑我们添加一个，添加两个或添加三个字母的情况（我们不计算删除，交换等）。这是一个拼写错误的单词，只有九个字符长。关于“真实”的话，它会爆炸得更快。

当然，你可以变得“聪明”并以一种不会有太多欺骗等方式产生这种情况，但这一点仍然存在：它是一种组合爆炸，会迅速爆炸。

无论如何......这是一个例子。我只是将有效单词的字典（在这种情况下只包含四个单词）传递给相应的方法，以保持简短。

您显然希望用自己的LED实现取代对LED的调用。

双联音电话只是一个例子：在一个真实的拼写检查词中，“声音相似” 尽管进一步的LED应被视为“更正确”，因此通常首先建议。例如，“优化”和“无意识”与LED的观点相差甚远，但是使用双音素电话，你应该将“优化”作为第一个建议之一。

（免责声明：以下内容在几分钟后开始，它没有考虑大写，非英语单词等：它不是一个真正的拼写检查器，只是一个例子）

   @Test
    public void spellCheck() {
        final String src = "misspeled";
        final Set<String> validWords = new HashSet<String>();
        validWords.add("boing");
        validWords.add("Yahoo!");
        validWords.add("misspelled");
        validWords.add("stackoverflow");
        final List<String> candidates = findNonSortedCandidates( src, validWords );
        final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
        for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
            System.out.println( entry.getValue() + " @ LED: " + entry.getKey() );
        }
    }

    private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
            final List<String> candidates,
            final String mispelledWord
    ) {
        final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
        for ( final String candidate : candidates ) {
            res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
        }
        return res;
    }

    private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
        return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
    }

在这里，您可以使用多种方法生成所有可能的候选者我只实现了一种这样的方法（并且很快就可能是假的，但这不是重点;）

    private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
        final List<String> res = new ArrayList<String>();
        res.addAll( allCombinationAddingOneLetter(src, validWords) );
//        res.addAll( allCombinationRemovingOneLetter(src) );
//        res.addAll( allCombinationInvertingLetters(src) );
        return res;
    }

    private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
        final List<String> res = new ArrayList<String>();
        for (char c = 'a'; c < 'z'; c++) {
            for (int i = 0; i < src.length(); i++) {
                final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
                if ( validWords.contains(candidate) ) {
                    res.add(candidate); // only adding candidates we know are valid words
                }
            }
            if ( validWords.contains(src+c) ) {
                res.add( src + c );
            }
        }
        return res;
    }

Answer 2

您可以尝试的一件事是，增加Java的堆大小，以克服“内存不足错误”。

以下文章将帮助您了解如何在Java中增加堆大小

http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html

但我认为解决问题的更好方法是找出比当前算法更好的算法

Answer 3

没有更多有关该主题的信息，社区可以为您做的很多......您可以从以下内容开始：

看看你的Profiler所说的内容（在它运行一段时间后）：有什么东西堆积？是否有很多对象 - 这通常会给你一个关于你的代码有什么问题的提示。
在某处发布已保存的转储并将其链接到您的问题中，以便其他人可以查看它。
告诉我们您使用的是哪个分析器，然后有人可以提供有关在哪里寻找有价值信息的提示。
在将问题缩小到代码的特定部分之后，您无法弄清楚为什么在您的记忆中有$FOO这么多对象，请发布相关部分的片段。

优化Java代码的技巧

3 个答案: