Question

我最近遇到了以下面试问题：

给定一个输入字符串和一个单词字典，实现一个方法，将输入字符串分解为一个空格分隔的字典单词串，搜索引擎可能会将其用于“你的意思是什么？”例如，“applepie”的输入应该产生“苹果派”的输出。

就复杂性而言，我似乎无法获得最佳解决方案。有没有人对如何有效地做这个有任何建议？

Answer 1

看起来问题正是我的采访问题，直到我在The Noisy Channel的post中使用的示例。很高兴你喜欢这个解决方案。我敢肯定你无法击败我描述的O（n ^ 2）动态编程/ memoization解决方案，用于最坏情况的性能。

如果您的字典和输入不是病态的，您可以在实践中做得更好。例如，如果您可以在线性时间中识别输入字符串的子串在字典中（例如，使用trie），并且如果这些子串的数量是恒定的，则总时间将是线性的。当然，这是很多假设，但真实数据通常比病态最坏情况好得多。

还有一些有趣的问题变体使其变得更难，例如枚举所有有效的分段，根据最佳定义输出最佳分段，处理太大而不适合内存的字典，以及处理不精确的分段（例如，纠正拼写错误）。请随时在我的博客上发表评论或以其他方式与我联系以进行跟进。

Answer 2

此 link 将此问题描述为一个完美的采访问题，并提供了几种方法来解决它。基本上它涉及recursive backtracking。在这个级别，它将产生O（2 ^ n）复杂度。使用memoization的有效解决方案可能会将此问题降低到O（n ^ 2）。

Answer 3

使用python，我们可以编写两个函数，第一个segment将一段连续文本的第一个分段返回给定字典的单词或None如果没有找到这样的分词。另一个函数segment_all返回找到的所有分段的列表。最坏情况复杂度为O（n ** 2），其中n是字符中的输入字符串长度。

此处介绍的解决方案可以扩展为包括拼写更正和双字母分析，以确定最可能的细分。

def memo(func):
    '''
    Applies simple memoization to a function
    '''
    cache = {}
    def closure(*args):
        if args in cache:
            v = cache[args]
        else:
            v = func(*args)
            cache[args] = v
        return v
    return closure


def segment(text, words):
    '''
    Return the first match that is the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        if text in words: return text
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix = _segment(suffix)
            if prefix in words and segmented_suffix:
                return '%s %s' % (prefix, segmented_suffix)
        return None
    return _segment(text)


def segment_all(text, words):
    '''
    Return a full list of matches that are the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        matches = []
        if text in words: 
            matches.append(text)
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix_matches = _segment(suffix)
            if prefix in words and len(segmented_suffix_matches):
                for match in segmented_suffix_matches:
                    matches.append('%s %s' % (prefix, match))
        return matches 
    return _segment(text)


if __name__ == "__main__":    
    string = 'cargocultscience'
    words = set('car cargo go cult science'.split())
    print segment(string, words)
    # >>> car go cult science
    print segment_all(string, words)
    # >>> ['car go cult science', 'cargo cult science']

Answer 4

一种选择是将所有有效的英语单词存储在一个trie中。完成此操作后，您可以按照字符串中的字母开始从根向下移动trie。每当您找到标记为单词的节点时，您有两个选择：

此时断开输入，或
继续推广这个词。

一旦您将输入分解为一组合法且没有剩余字符的单词，您就可以声称已找到匹配项。因为在每个字母你都有一个强制选项（要么你正在构建一个无效的词，应该停止 - 或者 - 你可以继续扩展这个词）或两个选项（分裂或继续），你可以实现这个功能使用详尽的递归：

PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
    // If you walked off the trie, this path fails.
    if trieNode is null, return.

    // If this trie node is a word, consider what happens if you split
    // the word here.
    if trieNode.isWord:
        // If there is no input left, you're done and have a partition.
        if lettersLeft is empty, output wordBreaks + wordSoFar and return

        // Otherwise, try splitting here.
        PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)

    // Otherwise, consume the next letter and continue:
    PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0], 
                   wordBreaks, trieNode.child[lettersLeft[0])

在病态最坏的情况下，这将列出字符串的所有分区，这些分区可以指数长。但是，只有在您能够以多种方式对字符串进行分区时才会出现这种情况，这些方式都以有效的英语单词开头，并且在实践中不太可能发生。如果字符串有很多分区，我们可能会花很多时间找到它们。例如，考虑字符串“dotheredo”。我们可以分解这么多方面：

do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo

为避免这种情况，您可能希望限制所报告的答案数量，可能是两到三个。

因为当我们离开trie时我们切断了递归，如果我们尝试一个不会使字符串的其余部分有效的分割，我们会很快检测到它。

希望这有帮助！

Answer 5

import java.util。*;

class Position {
    int indexTest,no;
    Position(int indexTest,int no)
    {
        this.indexTest=indexTest;
        this.no=no;
    } } class RandomWordCombo {
    static boolean isCombo(String[] dict,String test)
    {
        HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
        Stack<Position> pos=new Stack<Position>();
        for(String each:dict)
        {
            if(dic.containsKey(""+each.charAt(0)))
            {
                //System.out.println("=========it is here");
                ArrayList<String> temp=dic.get(""+each.charAt(0));
                temp.add(each);
                dic.put(""+each.charAt(0),temp);
            }
            else
            {
                ArrayList<String> temp=new ArrayList<String>();
                temp.add(each);
                dic.put(""+each.charAt(0),temp);
            }
        }
        Iterator it = dic.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry pair = (Map.Entry)it.next();
        System.out.println("key: "+pair.getKey());
        for(String str:(ArrayList<String>)pair.getValue())
        {
            System.out.print(str);
        }
    }
        pos.push(new Position(0,0));
        while(!pos.isEmpty())
        {
            Position position=pos.pop();
            System.out.println("position index: "+position.indexTest+" no: "+position.no);
            if(dic.containsKey(""+test.charAt(position.indexTest)))
            {
                ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest)); 
                if(strings.size()>1&&position.no<strings.size()-1)
                     pos.push(new Position(position.indexTest,position.no+1));
                String str=strings.get(position.no);
                if(position.indexTest+str.length()==test.length())
                    return true;
                pos.push(new Position(position.indexTest+str.length(),0));
            }
        }
        return false;
    }
    public static void main(String[] st)
    {
        String[] dic={"world","hello","super","hell"};
        System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
    } }

我做过类似的问题。如果给定的字符串是字典单词的组合，则此解决方案给出true或false。它可以很容易地转换为空格分隔的字符串。它的平均复杂度是O（n），其中n：给定字符串中没有字典单词。

将一个字符串分成一系列单词

5 个答案: