Question

我有600个单词的文字，应该删除所有引号，数字（年，日期，..），数字，...。我应该只有单词，并且必须放入词典中。

因此，我尝试遍历每个循环并获取第一个字母并将其保存在列表中。然后，我将每一行分割成一个字。例如：

You are pretty.

You
are
pretty

问题是连续有多个单词，它们仍然相同，但不应该相同。我已经尝试修复它，但是找不到任何解决方案。

onOpen(e)

Answer 1

您可以使用正则表达式进行拆分，然后使用LINQ创建字典：

var dictionary = Regex.Split(text, @"\W+")
    .GroupBy(m => m, StringComparer.OrdinalIgnoreCase) // Case-insensitive
    .ToDictionary(m => m.Key, m => m.Count());

更新

在应用示例代码时，您的任务类可能会变成这样，以构建两个字典（并考虑不区分大小写）：

public class Aufgabe
{
    const string ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ";
    public Dictionary<string, int> words;
    public Dictionary<char, List<string>> firstletter;
    public Aufgabe(string filename)
    {
        var text = File.ReadAllText(filename);
        words = Regex.Split(text, @"\W+")
            .GroupBy(m => m, StringComparer.OrdinalIgnoreCase)
            .ToDictionary(m => m.Key, m => m.Count());
        firstletter = ALPHABET.ToDictionary(a => a, // First-letter key
            a => words.Keys.Where(m => a == char.ToUpper(m[0])).ToList()); // Words
    }
}

Answer 2

其他答案缺少一些东西：

不进行验证以检查文本是否为单词
比较不应该区分大小写（即spain，Spain和SPAIN应该被视为同一个单词）

我的解决方案：

StringComparer comparer = StringComparer.OrdinalIgnoreCase;
string text = "The 'rain' in spain falls mainly on the plain. 07 November 2018 20:02:07 - 20180520 I said the Plain in SPAIN. 12345";

var dictionary = Regex.Split(text, @"\W+")
                      .Where(IsValidWord)
                      .GroupBy(m => m, comparer)
                      .ToDictionary(m => m.Key, m => m.Count(), comparer);

方法IsValidWord：

// logic to validate word goes here
private static bool IsValidWord(string text)
{
    double value;

    bool isNumeric = double.TryParse(text, out value);

    // add more validation rules here

    return !isNumeric;
}

编辑

我在您的代码中注意到您有一个带有words grouped by first letter的字典。可以这样实现（使用上一个字典）：

var lettersDictionary = dictionary.Keys.GroupBy(x => x.Substring(0, 1), 
        (alphabet, subList) => new {
            Alphabet = alphabet,
            SubList = subList.OrderBy(x => x, comparer).ToList()
        })
        .ToDictionary(m => m.Alphabet, m => m.SubList, comparer);

Answer 3

这是Regex的一种方法，请注意，大小写敏感性尚未得到解决

var text = "The 'rain' in spain falls mainly on the plain. I said the plain in spain";

var result = new Dictionary<string,string>();

Regex.Matches(text, @"[^\s]+")
     .OfType<Match>()
     .Select(m => Regex.Replace(m.Value, @"\W", string.Empty))
     .ToList()
     .ForEach(word =>
     {
        if (!result.ContainsKey(word))
            result.Add(word, word);
     });

结果

分割文字并将其放入字典中

3 个答案:

编辑