我有一个字符串:
var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist."
删除所有非字母字符,然后将每个单词拆分到新行中的最佳方法是什么,以便我可以存储和计算每个单词有多少个?
var words = text.Split(' ');
foreach(var word in words)
{
word.Trim(',','.','-');
}
我尝试了多种操作,例如将text.Replace(characters)
与whitespace
分开。我已经尝试过Regex(我不想使用)。
我还尝试过使用StringBuilder类从文本(字符串)中获取字符,并且仅在字母a-z / A-Z后面附加字符。
还尝试调用sb.Replace或sb.Remove将不需要的字符存储在字典中。但是我似乎仍然会遇到不需要的角色?
我尝试的所有内容,似乎都至少有一个我不想要的角色,并且无法弄清楚为什么它不起作用。
谢谢!
答案 0 :(得分:2)
使用没有RegEx或Linq的扩展方法
static public class StringHelper
{
static public Dictionary<string, int> CountDistinctWords(this string text)
{
string str = text.Replace(Environment.NewLine, " ");
var words = new Dictionary<string, int>();
var builder = new StringBuilder();
char charCurrent;
Action processBuilder = () =>
{
var word = builder.ToString();
if ( !string.IsNullOrEmpty(word) )
if ( !words.ContainsKey(word) )
words.Add(word, 1);
else
words[word]++;
};
for ( int index = 0; index < str.Length; index++ )
{
charCurrent = str[index];
if ( char.IsLetter(charCurrent) )
builder.Append(charCurrent);
else
if ( !char.IsNumber(charCurrent) )
charCurrent = ' ';
if ( char.IsWhiteSpace(charCurrent) )
{
processBuilder();
builder.Clear();
}
}
processBuilder();
return words;
}
}
它解析所有拒绝所有非字母的字符,同时为每个单词创建一个字典,统计出现次数。
测试
var result = text.CountDistinctWords();
Console.WriteLine($"Found {result.Count()} distinct words:");
Console.WriteLine();
foreach ( var item in result )
Console.WriteLine($"{item.Key}: {item.Value}");
样品结果
Found 36 distinct words:
I: 3
have: 2
a: 2
long: 1
string: 1
with: 1
load: 1
of: 3
words: 1
and: 3
it: 1
includes: 1
new: 1
lines: 1
non: 1
letter: 1
characters: 1
want: 1
to: 2
remove: 1
all: 1
them: 1
split: 1
this: 1
text: 1
one: 1
word: 2
per: 1
line: 1
then: 1
can: 1
count: 1
how: 1
many: 1
each: 1
exist: 1
答案 1 :(得分:0)
我确实认为,就性能和清晰度而言,使用字典来计数频率的解决方案是最佳实践。这是我的版本,与接受的答案略有不同(我使用 String.Split()而不是遍历字符串的字符):
var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist.";
var words = text.Split(new [] {',', '.', '-', ' ', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
var freqByWord = new Dictionary<string, int>();
foreach (var word in words)
{
if (freqByWord.ContainsKey(word))
{
freqByWord[word]++; // we found the same word
}
else
{
freqByWord.Add(word, 1); // we don't have this one yet
}
}
foreach (var word in freqByWord.Keys)
{
Console.WriteLine($"{word}: {freqByWord[word]}");
}
结果几乎相同:
I: 3
have: 2
a: 2
long: 1
string: 1
with: 1
load: 1
of: 3
words: 1
and: 3
it: 1
includes: 1
new: 1
lines: 1
non: 1
letter: 1
characters: 1
want: 1
to: 2
remove: 1
all: 1
them: 1
split: 1
this: 1
text: 1
one: 1
word: 2
per: 1
line: 1
then: 1
can: 1
count: 1
how: 1
many: 1
each: 1
exist: 1
答案 2 :(得分:-1)
使用正则表达式排除非字母字符。这也将为您提供所有单词的集合。
var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist.";
var words = Regex.Matches(text, @"[A-Za-z ]+").Cast<Match>().SelectMany(n => n.Value.Trim().Split(' '));
int wordCount = words.Count();