所以我在txt文件中有任意文本,我需要找到 10个最常用的单词。我该怎么办?我想我应该阅读从点到点的句子并把它放到一个数组中,但不知道该怎么做。
答案 0 :(得分:9)
你可以用Linq实现它。尝试这样的事情:
var words = "two one three one three one";
var orderedWords = words
.Split(' ')
.GroupBy(x => x)
.Select(x => new {
KeyField = x.Key,
Count = x.Count() })
.OrderByDescending(x => x.Count)
.Take(10);
答案 1 :(得分:2)
将所有数据转换为String,并将其拆分为数组
示例:
char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = "one\ttwo three:four,five six seven";
string[] words = text.Split(delimiterChars);
var dict = new Dictionary<String, int>();
foreach(var value in array)
{
if (dict.ContainsKey(value))
dict[value]++;
else
dict[value] = 1;
}
for(int i=0;i<dict.length();i++) //or i<10
{
Console.WriteLine(dict[i]);
}
首先需要使用更大的值对数组进行排序。
答案 2 :(得分:1)
该任务最困难的部分是将初始文本拆分为单词。 自然语言(例如英语)这个词非常复杂:
Forget-me-not // 1 word (a nice blue flower)
Do not Forget me! // 4 words
Cannot // 1 word or shall we split "cannot" into "can" + "not"?
May not // 2 words
George W. Bush // Is "W" a word?
W.A.S.P. // ...If it is, is it equal to "W" in the "W.A.S.P"?
Donald Trump // Homonyms: name
Spades is a trump // ...and a special follow in a game of cards
It's an IT; it is // "It" and "IT" are different (IT is an acronym), "It" and "it" are same
另一个问题是:您可能希望将It
和it
统一为同一个字,但将IT
视为不同的缩写。作为第一次尝试,我建议这样的事情:
var top10words = File
.ReadLines(@"C:\MyFile.txt")
.SelectMany(line => Regex
.Matches(value, @"[A-Za-z-']+")
.OfType<Match>()
.Select(match => CultureInfo.InvariantCulture.TextInfo.ToTitleCase(match.Value)))
.GroupBy(word => word)
.Select(chunk => new {
word = chunk.Key,
count = chunk.Count()})
.OrderByDescending(item => item.count)
.ThenBy(item => item.word)
.Take(10);
在我的解决方案中,我假设:
A..Z, a..z
,-
(破折号)和'
(叛逆)字母TitleCase
已被用于将所有大写首字母缩略词与常规词语分开(It
和it
将被视为同一个词,而IT
则视为不同的词) 答案 3 :(得分:0)
这是我根据Aldi Renaldi Gunawan和JanneP提供的答案写的一种组合方法。我想定界符char取决于您的用例。您可以为10
参数提供numWords
。
public static Dictionary<string, int> WordCount(string text, int numWords = int.MaxValue)
{
var delimiterChars = new char[] { ' ', ',', ':', '\t', '\"', '\r', '{', '}', '[', ']', '=', '/' };
return text
.Split(delimiterChars)
.Where(x => x.Length > 0)
.Select(x => x.ToLower())
.GroupBy(x => x)
.Select(x => new { Word = x.Key, Count = x.Count() })
.OrderByDescending(x => x.Count)
.Take(numWords)
.ToDictionary(x => x.Word, x => x.Count);
}