拆分单行的内容

时间:2015-12-15 10:14:22

标签: algorithm dna-sequence

我刚遇到一个问题,输入是一个单词的字符串。 这行不可读,

赞,I want to leave写为Iwanttoleave

问题在于分离出每个标记(单词,数字,缩写等)

我不知道从哪里开始

我想到的第一个想法就是制作字典,然后进行相应的映射,但我认为制作一本字典并不是一个好主意。

有人可以提出一些算法吗?

2 个答案:

答案 0 :(得分:1)

我建议您使用Trie和所有有效的单词(整个英语词典?),而不是使用词典。然后,您可以在输入行和trie中同时开始移动一个字母。如果该字母在trie中产生更多结果,您可以继续扩展当前单词,如果没有,您可以开始在该单词中查找新单词。

这不会只是前瞻性的搜索,所以你需要某种回溯。

// This method Generates a list with all the matching phrases for the given input
List<string> CandidatePhrases(string input) {
    Trie validWords = BuildTheTrieWithAllValidWords();
    List<string> currentWords = new List<string>();
    List<string> possiblePhrases = new List<string>();
    // The root of the trie has an empty key that points to all the first letters of all words
    Trie currentWord = validWords;
    int currentLetter = -1;
    // Calls a backtracking method that creates all possible phrases
    FindPossiblePhrases(input, validWords, currentWords, currentWord, currentLetter, possiblePhrases);

    return possiblePhrases;
}

// The Trie structure could be something like
class Trie {
    char key;
    bool valid;
    List<Trie> children;
    Trie parent;

    Trie Next(char nextLetter) {
        return children.FirstOrDefault(c => c.key == nextLetter);
    }

    string WholeWord() {
        Debug.Assert(valid);
        string word = "";
        Trie current = this;
        while (current.Key != '\0')
        {
            word = current.Key + word;
            current = current.parent;
        }
    }
}

void FindPossiblePhrases(string input, Trie validWords, List<string> currentWords, Trie currentWord, int currentLetter, List<string> possiblePhrases) {
    if (currentLetter == input.Length - 1) {
        if (currentWord.valid) {
            string phrase = ""
            foreach (string word in currentWords) {
                phrase += word;
                phrase += " ";
            }
            phrase += currentWord.WholeWord();
            possiblePhrases.Add(phrase);
        }
    }
    else {
        // The currentWord may be a valid word. If that's the case, the next letter could be the first of a new word, or could be the next letter of a bigger word that begins with currentWord
        if (currentWord.valid) {
            // Try to match phrases when the currentWord is a valid word
            currentWords.Add(currentWord.WholeWord());
            FindPossiblePhrases(input, validWords, currentWords, validWords, currentLetter, possiblePhrases);
            currentWords.RemoveAt(currentWords.Length - 1);
        }

        // If either the currentWord is a valid word, or not, try to match a longer word that begins with current word
        int nextLetter = currentLetter + 1;
        Trie nextWord = currentWord.Next(input[nextLetter]);
        // If the nextWord is null, there was no matching word that begins with currentWord and has input[nextLetter] as the following letter.
        if (nextWord != null) {
            FindPossiblePhrases(input, validWords, currentWords, nextWord, nextLetter, possiblePhrases);
        }
    }
}    

答案 1 :(得分:1)

首先,创建一个字典,帮助您识别某个字符串是否是有效字。

bool isValidString(String s){
    if(dictionary.contains(s))
        return true;
    return false;
}

现在,您可以编写一个递归代码来分割字符串并创建一组实际有用的单词。

ArrayList usefulWords = new ArrayList<String>;      //global declaration
void split(String s){
    int l = s.length();
    int i,j;
    for(i = l-1; i >= 0; i--){
        if(isValidString(s.substr(i,l)){     //s.substr(i,l) will return substring starting from index `i` and ending at `l-1`
            usefulWords.add(s.substr(i,l));
            split(s.substr(0,i));
        }
    }
}

现在,使用这些usefulWords生成所有可能的字符串。也许是这样的:

ArrayList<String> splits = new ArrayList<String>[10];   //assuming max 10 possible outputs
ArrayList<String>[] allPossibleStrings(String s, int level){
    for(int i = 0; i <  s.length(); i++){
        if(usefulWords.contains(s.substr(0,i)){
            splits[level].add(s.substr(0,i));
            allPossibleStrings(s.substr(i,s.length()),level);
            level++;
        }
    }
}

现在,这段代码以某种随意的方式为您提供所有可能的拆分。例如

dictionary = {cat, dog, i, am, pro, gram, program, programmer, grammer}

input:
string = program
output:
splits[0] = {pro, gram}
splits[1] = {program}

input:
string = iamprogram
output:
splits[0] = {i, am, pro, gram}   //since `mer` is not in dictionary
splits[1] = {program}

我没有考虑到最后一部分,但我认为你应该能够根据你的要求从那里制定代码。

此外,由于没有标记任何语言,我可以自由地使用类似JAVA的语法编写代码,因为它很容易理解。