C# - 将完全大写的字符串拆分为单独的单词(无空格)

时间:2017-11-14 02:17:20

标签: c# string split text-parsing

我目前正在开展一个项目,我需要将单个单词与字符串分开。问题是字符串中的所有单词都是大写的,没有空格。以下是程序接收的输入类型的示例:

“COMPUTERFIVECODECOLOR”

这应该分成以下结果:

“计算机” “五” “码” “COLOR”

到目前为止,我一直在使用以下方法来分割我的字符串(它适用于除此边缘情况之外的所有场景):

private static List<string> NormalizeSections(List<string> wordList)
        {
            var modifiedList = new List<string>();
            foreach (var word in wordList)
            {
                int index = wordList.IndexOf(word);
                var split = Regex.Split(word, @"(\p{Lu}\p{Ll}+)").ToList();
                split.RemoveAll(i => i == "");

                modifiedList.AddRange(split);
            }
            return modifiedList;
        }

如果有人对如何处理这个有任何想法,我会非常乐意听到它们。另外,如果我能提供更多信息,请告诉我。

2 个答案:

答案 0 :(得分:2)

我正在考虑如何搜索匹配的单词。首先,在给定的字符索引处,将优先考虑字典中最长的匹配字。 其次,如果在给定的字符索引处没有找到任何单词,我们将转到下一个字符并再次搜索。

下面的实现使用Trie来索引所有有效单词的字典。我们不是遍历字典中的每个单词,而是遍历输入字符串中的每个字符,寻找最长的单词。

我从这个非常方便的答案中解除了C#中Trie的实现:https://stackoverflow.com/a/6073004

编辑:在添加一个单词时修复了Trie中的一个错误,该单词是现有单词的子字符串,例如Emergency然后Emerge。

该代码可在DotNetFiddle上找到。

using System;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {

        var words = new[] { "COMPUTE", "FIVE", "CODE", "COLOR", "PUT", "EMERGENCY", "MERGE", "EMERGE" };

        var trie = new Trie(words);

        var input = "COMPUTEEMERGEFIVECODECOLOR";

        for (var charIndex = 0; charIndex < input.Length; charIndex++)
        {
            var longestWord = FindLongestWord(trie.Root, input, charIndex);

            if (longestWord == null)
            {
                Console.WriteLine("No word found at char index {0}", charIndex);
            }
            else
            {
                Console.WriteLine("Found {0} at char index {1}", longestWord, charIndex);

                charIndex += longestWord.Length - 1;
            }
        }

    }

    static private string FindLongestWord(Trie.Node node, string input, int charIndex)
    {
        var character = char.ToUpper(input[charIndex]);

        string longestWord = null;

        foreach (var edge in node.Edges)
        {
            if (edge.Key.ToChar() == character)
            {
                var foundWord = edge.Value.Word;

                if (!edge.Value.IsTerminal)
                {
                    var longerWord = FindLongestWord(edge.Value, input, charIndex + 1);

                    if (longerWord != null) foundWord = longerWord;
                }

                if (foundWord != null && (longestWord == null || edge.Value.Word.Length > longestWord.Length))
                {
                    longestWord = foundWord;
                }
            }
        }

        return longestWord;
    }
}

//Trie taken from: https://stackoverflow.com/a/6073004
public struct Letter
{
    public const string Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    public static implicit operator Letter(char c)
    {
        return new Letter() { Index = Chars.IndexOf(c) };
    }
    public int Index;
    public char ToChar()
    {
        return Chars[Index];
    }
    public override string ToString()
    {
        return Chars[Index].ToString();
    }
}

public class Trie
{
    public class Node
    {
        public string Word;
        public bool IsTerminal { get { return Edges.Count == 0 && Word != null; } }
        public Dictionary<Letter, Node> Edges = new Dictionary<Letter, Node>();
    }

    public Node Root = new Node();

    public Trie(string[] words)
    {
        for (int w = 0; w < words.Length; w++)
        {
            var word = words[w];
            var node = Root;
            for (int len = 1; len <= word.Length; len++)
            {
                var letter = word[len - 1];
                Node next;
                if (!node.Edges.TryGetValue(letter, out next))
                {
                    next = new Node();

                    node.Edges.Add(letter, next);
                }

                if (len == word.Length)
                {
                    next.Word = word;
                }

                node = next;
            }
        }
    }

}

输出是:

Found COMPUTE at char index 0
Found EMERGE at char index 7
Found FIVE at char index 13
Found CODE at char index 17    
Found COLOR at char index 21

答案 1 :(得分:1)

假设词典中的单词不相互包含(例如&#34; TOO&#34;和#34; TOOK&#34;),我不明白为什么这个问题需要一个更复杂的解决方案比这个单行函数:

static public List<string> Normalize(string input, List<string> dictionary)
{
    return dictionary.Where(a => input.Contains(a)).ToList();       
}

(如果DO相互包含,请参见下文。)

完整示例:

using System;
using System.Linq;
using System.Collections.Generic;

public class Program
{
    static public List<string> Normalize(string input, List<string> dictionary)
    {
        return dictionary.Where(a => input.Contains(a)).ToList();       
    }

    public static void Main()
    {
        List<string> dictionary = new List<string>
        {
            "COMPUTER","FIVE","CODE","COLOR","FOO"
        };
        string input = "COMPUTERFIVECODECOLORBAR";
        var normalized = Normalize(input, dictionary);
        foreach (var s in normalized)
        {
            Console.WriteLine(s);
        }
    }
}

输出:

COMPUTER
FIVE
CODE
COLOR

Code on DotNetFiddle

另一方面,如果您确定您的关键字确实重叠,那么您并非完全没有运气。如果您确定输入字符串仅包含字典中的单词,并且它们是连续的,则可以使用更复杂的函数。

    static public List<string> Normalize2(string input, List<string> dictionary)
    {
        var sorted = dictionary.OrderByDescending( a => a.Length).ToList();
        var results = new List<string>();
        bool found = false;

        do
        {
            found = false;
            foreach (var s in sorted)
            {
                if (input.StartsWith(s))
                {
                    found = true;
                    results.Add(s);
                    input = input.Substring(s.Length);
                    break;
                }
            }
        }
        while (input != "" && found);

        return results;
    }

    public static void Main()
    {
        List<string> dictionary = new List<string>
        {
            "SHORT","LONG","LONGER","FOO","FOOD"
        };
        string input = "FOODSHORTLONGERFOO";
        var normalized = Normalize2(input, dictionary);
        foreach (var s in normalized)
        {
            Console.WriteLine(s);
        }
    }

它的工作方式是它只查看字符串的开头并首先查找最长的关键字。找到一个后,它会从输入字符串中删除它并继续搜索。

输出:

FOOD
SHORT
LONGER
FOO

请注意&#34; LONG&#34;不包括在内,因为我们包括&#34; LONGER&#34;,但&#34; FOO&#34;包括在内,因为它在字符串中与&#34; FOOD&#34;。

分开

此外,使用第二种解决方案,关键字将以与它们在原始字符串中出现的顺序相同的顺序出现在结果字典中。因此,如果要求实际拆分短语而不是以任何顺序检测关键字,则应使用第二个函数。

Code