我有600个单词的文字,应该删除所有引号,数字(年,日期,..),数字,...。我应该只有单词,并且必须放入词典中。
因此,我尝试遍历每个循环并获取第一个字母并将其保存在列表中。然后,我将每一行分割成一个字。 例如:
You are pretty.
You are pretty
问题是连续有多个单词,它们仍然相同,但不应该相同。我已经尝试修复它,但是找不到任何解决方案。
onOpen(e)
答案 0 :(得分:1)
您可以使用正则表达式进行拆分,然后使用LINQ创建字典:
var dictionary = Regex.Split(text, @"\W+")
.GroupBy(m => m, StringComparer.OrdinalIgnoreCase) // Case-insensitive
.ToDictionary(m => m.Key, m => m.Count());
更新
在应用示例代码时,您的任务类可能会变成这样,以构建两个字典(并考虑不区分大小写):
public class Aufgabe
{
const string ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ";
public Dictionary<string, int> words;
public Dictionary<char, List<string>> firstletter;
public Aufgabe(string filename)
{
var text = File.ReadAllText(filename);
words = Regex.Split(text, @"\W+")
.GroupBy(m => m, StringComparer.OrdinalIgnoreCase)
.ToDictionary(m => m.Key, m => m.Count());
firstletter = ALPHABET.ToDictionary(a => a, // First-letter key
a => words.Keys.Where(m => a == char.ToUpper(m[0])).ToList()); // Words
}
}
答案 1 :(得分:1)
其他答案缺少一些东西:
spain
,Spain
和SPAIN
应该被视为同一个单词)我的解决方案:
StringComparer comparer = StringComparer.OrdinalIgnoreCase;
string text = "The 'rain' in spain falls mainly on the plain. 07 November 2018 20:02:07 - 20180520 I said the Plain in SPAIN. 12345";
var dictionary = Regex.Split(text, @"\W+")
.Where(IsValidWord)
.GroupBy(m => m, comparer)
.ToDictionary(m => m.Key, m => m.Count(), comparer);
方法IsValidWord
:
// logic to validate word goes here
private static bool IsValidWord(string text)
{
double value;
bool isNumeric = double.TryParse(text, out value);
// add more validation rules here
return !isNumeric;
}
我在您的代码中注意到您有一个带有words grouped by first letter的字典。可以这样实现(使用上一个字典):
var lettersDictionary = dictionary.Keys.GroupBy(x => x.Substring(0, 1),
(alphabet, subList) => new {
Alphabet = alphabet,
SubList = subList.OrderBy(x => x, comparer).ToList()
})
.ToDictionary(m => m.Alphabet, m => m.SubList, comparer);
答案 2 :(得分:0)
这是Regex的一种方法,请注意,大小写敏感性尚未得到解决
var text = "The 'rain' in spain falls mainly on the plain. I said the plain in spain";
var result = new Dictionary<string,string>();
Regex.Matches(text, @"[^\s]+")
.OfType<Match>()
.Select(m => Regex.Replace(m.Value, @"\W", string.Empty))
.ToList()
.ForEach(word =>
{
if (!result.ContainsKey(word))
result.Add(word, word);
});
结果