我想在C#中对大字符串中特定短语的出现进行拆分,分组和计数。
以下伪代码应该说明我想要实现的目标。
var my_string = "In the end this is not the end";
my_string.groupCount(2);
==>
[0] : {Key: "In the", Count:1}
[1] : {Key: "the end", Count:2}
[2] : {Key: "end this", Count: 1}
[3] : {Key: "this is", Count: 1}
[4] : {Key: "is not", Count: 1}
[5] : {Key: "not the", Count: 1}
正如您将注意到的,这并不像分割字符串和计算每个子字符串那么简单。该示例每2个单词组,但理想情况下它应该能够处理任何数字。
答案 0 :(得分:1)
以下是您如何处理此问题的大纲:
Split
的常规string
方法获取单词以下是如何实现这一点:
var counts = new Dictionary<string,int>();
var tokens = str.Split(' ');
for (var i = 0 ; i < tokens.Length-1 ; i++) {
var key = tokens[i]+" "+tokens[i+1];
int c;
if (!counts.TryGetValue(key, out c)) {
c = 0;
}
counts[key] = c + 1;
}
答案 1 :(得分:1)
这是使用ILookup<string, string[]>
计算每个数组出现次数的另一种方法:
var my_string = "In the end this is not the end";
int step = 2;
string[] words = my_string.Split();
var groupWords = new List<string[]>();
for (int i = 0; i + step <= words.Length; i++)
{
string[] group = new string[step];
for (int ii = 0; ii < step; ii++)
group[ii] = words[i + ii];
groupWords.Add(group);
}
var lookup = groupWords.ToLookup(w => string.Join(" ", w));
foreach(var kv in lookup)
Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count());
输出:
Key: "In the", Count: 1
Key: "the end", Count: 2
Key: "end this", Count: 1
Key: "this is", Count: 1
Key: "is not", Count: 1
Key: "not the", Count: 1
答案 2 :(得分:0)
这是我的实施。我已将其更新为将工作转移到函数中,并允许您指定任意组大小。
public static Dictionary<string,int> groupCount(string str, int groupSize)
{
string[] tokens = str.Split(new char[] { ' ' });
var dict = new Dictionary<string,int>();
for ( int i = 0; i < tokens.Length - (groupSize-1); i++ )
{
string key = "";
for ( int j = 0; j < groupSize; j++ )
{
key += tokens[i+j] + " ";
}
key = key.Substring(0, key.Length-1);
if ( dict.ContainsKey(key) ) {
dict[key]++;
} else {
dict[key] = 1;
}
}
return dict;
}
像这样使用:
string str = "In the end this is not the end";
int groupSize = 2;
var dict = groupCount(str, groupSize);
Console.WriteLine("Group Of {0}:", groupSize);
foreach ( string k in dict.Keys ) {
Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]);
}
答案 3 :(得分:0)
您可以创建从给定单词构建短语的方法。效率不高(因为Skip),但实现简单:
private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount)
{
for(int i = 0; i <= words.Length - wordsCount; i++)
yield return String.Join(" ", words.Skip(i).Take(wordsCount));
}
休息很简单 - 将字符串拆分为单词,构建短语,并在原始字符串中出现每个短语:
var my_string = "In the end this is not the end";
var words = my_string.Split();
var result = from p in CreatePhrases(words, 2)
group p by p into g
select new { g.Key, Count = g.Count()};
结果:
[
Key: "In the", Count: 1,
Key: "the end", Count: 2,
Key: "end this", Count: 1,
Key: "this is", Count: 1,
Key: "is not", Count: 1,
Key: "not the", Count: 1
]
创建连续项目组的更有效方法(适用于任何IEnumerable):
public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
this IEnumerable<T> source, int size)
{
// You can check arguments here
Queue<T> bucket = new Queue<T>();
foreach(var item in source)
{
bucket.Enqueue(item);
if (bucket.Count == size)
{
yield return bucket.ToArray();
bucket.Dequeue();
}
}
}
所有计算都可以在一行中完成:
var my_string = "In the end this is not the end";
var result = my_string.Split()
.ToConsecutiveGroups(2)
.Select(words => String.Join(" ", words))
.GroupBy(p => p)
.Select(g => new { g.Key, Count = g.Count()});
答案 4 :(得分:0)
假设你需要处理LARGE字符串,我不建议你拆分整个字符串。 你需要经历它,记住最后的groupCount单词和dictinary的计数组合:
var my_string = "In the end this is not the end";
var groupCount = 2;
var groups = new Dictionary<string, int>();
var lastGroupCountWordIndexes = new Queue<int>();
for (int i = 0; i < my_string.Length; i++)
{
if (my_string[i] == ' ' || i == 0)
{
lastGroupCountWordIndexes.Enqueue(i);
}
if (lastGroupCountWordIndexes.Count >= groupCount)
{
var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue();
var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex);
if (!groups.ContainsKey(gruopKey))
{
groups.Add(gruopKey, 1);
}
else
{
groups[gruopKey]++;
}
}
}