拆分,分组和计数字符串

时间:2014-10-28 20:32:08

标签: c# regex string linq regex-group

我想在C#中对大字符串中特定短语的出现进行拆分,分组和计数。

以下伪代码应该说明我想要实现的目标。

var my_string = "In the end this is not the end";
my_string.groupCount(2);

==>
    [0] : {Key: "In the", Count:1}
    [1] : {Key: "the end", Count:2}
    [2] : {Key: "end this", Count: 1}
    [3] : {Key: "this is", Count: 1}
    [4] : {Key: "is not", Count: 1}
    [5] : {Key: "not the", Count: 1}

正如您将注意到的,这并不像分割字符串和计算每个子字符串那么简单。该示例每2个单词组,但理想情况下它应该能够处理任何数字。

5 个答案:

答案 0 :(得分:1)

以下是您如何处理此问题的大纲:

  • 使用Split的常规string方法获取单词
  • 制作计数词典
  • 浏览所有连续单词对,构建复合键并递增计数

以下是如何实现这一点:

var counts = new Dictionary<string,int>();
var tokens = str.Split(' ');
for (var i = 0 ; i < tokens.Length-1 ; i++) {
    var key = tokens[i]+" "+tokens[i+1];
    int c;
    if (!counts.TryGetValue(key, out c)) {
        c = 0;
    }
    counts[key] = c + 1;
}

Demo.

答案 1 :(得分:1)

这是使用ILookup<string, string[]>计算每个数组出现次数的另一种方法:

var my_string = "In the end this is not the end";
int step = 2;
string[] words = my_string.Split();
var groupWords = new List<string[]>();
for (int i = 0; i + step <= words.Length; i++)
{
    string[] group = new string[step];
    for (int ii = 0; ii < step; ii++)
        group[ii] = words[i + ii];
    groupWords.Add(group);
}
var lookup = groupWords.ToLookup(w => string.Join(" ", w));

foreach(var kv in lookup)
    Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count()); 

输出:

Key: "In the", Count: 1
Key: "the end", Count: 2
Key: "end this", Count: 1
Key: "this is", Count: 1
Key: "is not", Count: 1
Key: "not the", Count: 1

答案 2 :(得分:0)

这是我的实施。我已将其更新为将工作转移到函数中,并允许您指定任意组大小。

public static Dictionary<string,int> groupCount(string str, int groupSize) 
{
    string[] tokens = str.Split(new char[] { ' ' });

    var dict = new Dictionary<string,int>();
    for ( int i = 0; i < tokens.Length - (groupSize-1); i++ ) 
    {
        string key = "";
        for ( int j = 0; j < groupSize; j++ ) 
        {
            key += tokens[i+j] + " ";
        }
        key = key.Substring(0, key.Length-1);

        if ( dict.ContainsKey(key) ) {
            dict[key]++;
        } else {
            dict[key] = 1;
        }
    }

    return dict;
}

像这样使用:

string str = "In the end this is not the end";
int groupSize = 2;
var dict = groupCount(str, groupSize);

Console.WriteLine("Group Of {0}:", groupSize);
foreach ( string k in dict.Keys ) {
    Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]);
}

.NET Fiddle

答案 3 :(得分:0)

您可以创建从给定单词构建短语的方法。效率不高(因为Skip),但实现简单:

private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount)
{
    for(int i = 0; i <= words.Length - wordsCount; i++)
        yield return String.Join(" ", words.Skip(i).Take(wordsCount));
}

休息很简单 - 将字符串拆分为单词,构建短语,并在原始字符串中出现每个短语:

var my_string = "In the end this is not the end";
var words = my_string.Split();
var result = from p in CreatePhrases(words, 2)
             group p by p into g
             select new { g.Key, Count = g.Count()};

结果:

[
   Key: "In the", Count: 1,
   Key: "the end", Count: 2,
   Key: "end this", Count: 1,
   Key: "this is", Count: 1,
   Key: "is not", Count: 1,
   Key: "not the", Count: 1
]

创建连续项目组的更有效方法(适用于任何IEnumerable):

public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
    this IEnumerable<T> source, int size)
{
    // You can check arguments here            
    Queue<T> bucket = new Queue<T>();

    foreach(var item in source)
    {
        bucket.Enqueue(item);
        if (bucket.Count == size)
        {
            yield return bucket.ToArray();
            bucket.Dequeue();
        }
    }
}

所有计算都可以在一行中完成:

var my_string = "In the end this is not the end";
var result = my_string.Split()
               .ToConsecutiveGroups(2)
               .Select(words => String.Join(" ", words))
               .GroupBy(p => p)
               .Select(g => new { g.Key, Count = g.Count()});

答案 4 :(得分:0)

假设你需要处理LARGE字符串,我不建议你拆分整个字符串。 你需要经历它,记住最后的groupCount单词和dictinary的计数组合:

    var my_string = "In the end this is not the end";

    var groupCount = 2;

    var groups = new Dictionary<string, int>();
    var lastGroupCountWordIndexes = new Queue<int>();

    for (int i = 0; i < my_string.Length; i++)
    {
        if (my_string[i] == ' ' || i == 0)
        {
            lastGroupCountWordIndexes.Enqueue(i);
        }

        if (lastGroupCountWordIndexes.Count >= groupCount)
        {
            var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue();

            var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex);

            if (!groups.ContainsKey(gruopKey))
            {
                groups.Add(gruopKey, 1);
            }
            else
            {
                groups[gruopKey]++;
            }
        }

    }