Question

我创建了一个字典，并创建了读取txt文件的代码，并将文件中的每个单词输入到字典中。

file_client: local

id: awesome

file_roots:
  base:
    - /srv/salt/salt

pillar_roots:
  base:
    - /srv/salt/pillar

//Set up OpenFileDialog box, and prompt user to select file to open DialogResult openFileResult; OpenFileDialog file = new OpenFileDialog() ; file.Filter = "txt files (*.txt)|*.txt"; openFileResult = file.ShowDialog(); if (openFileResult == DialogResult.OK) { //If user selected file successfully opened //Reset form this.Controls.Clear(); this.InitializeComponent(); //Read from file, split into array of words Stream fs = file.OpenFile(); StreamReader reader; reader = new StreamReader(fs); string line = reader.ReadToEnd(); string[] words = line.Split(' ', '\n'); //Add each word and frequency to dictionary foreach (string s in words) { AddToDictionary(s); } //Reset variables, and set-up chart ResetVariables(); ChartInitialize(); foreach (string s in wordDictionary.Keys) { //Calculate statistics from dictionary ComputeStatistics(s); if (dpCount < 50) { AddToGraph(s); } } //Print statistics PrintStatistics(); }函数是：

AddToDictionary(s)

此程序正在读取的文本文件是：

public void AddToDictionary(string s)
    {
        //Function to add string to dictionary
        string wordLower = s.ToLower();
        if (wordDictionary.ContainsKey(wordLower))
        {
            int wordCount = wordDictionary[wordLower];
            wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
        }
        else
        {
            wordDictionary.Add(wordLower, 1);
            txtUnique.Text += wordLower + ", ";
        }
    }

我遇到的问题是“have”这个词在字典中出现了两次。我知道不会发生在字典中，但出于某种原因，它出现了两次。有谁知道为什么会这样？

Answer 1

如果你跑：

var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();

在调试器中检查words表明，由于两个字节的CRLF行终止符，某些“test”实例已获得\r - 这不是分割处理的。

要修复此问题，请将拆分更改为：

Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)

Answer 2

如果要支持多种语言，将文本拆分为单词通常很难解决。正则表达式在处理解析时通常比基本String.Split更好。

即。在你的情况下，你正在挑选＆＃34;新线＆＃34;作为一个单词的一部分，你也可以选择像不间断的空间......，

以下代码会选择比当前.Split更好的字词，以获取更多信息 - How do I split a phrase into words using Regex in C#

 var words = Regex.Split(line, @"\W+").ToList();

此外，你应该确保你的词典不区分大小写（根据你的需要选择比较器，也有文化意识）：

 var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);

Answer 3

我倾向于更改以下代码：

        //Read from file, split into array of words
        Stream fs = file.OpenFile();
        StreamReader reader;
        reader = new StreamReader(fs);
        string line = reader.ReadToEnd();
        string[] words = line.Split(' ', '\n');

        //Add each word and frequency to dictionary
        foreach (string s in words)
        {
            AddToDictionary(s);
        }

到此：

wordDictionary =
    File
        .ReadAllLines(file)
        .SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
        .Select(x => x.ToLower())
        .GroupBy(x => x)
        .ToDictionary(x => x.Key, x => x.Count());

这完全避免了行结尾的问题，并且还具有额外的优势，即它不会留下任何不存在的流。

C＃Dictionary允许看似相同的键

3 个答案: