我创建了一个字典,并创建了读取txt文件的代码,并将文件中的每个单词输入到字典中。
file_client: local
id: awesome
file_roots:
base:
- /srv/salt/salt
pillar_roots:
base:
- /srv/salt/pillar
//Set up OpenFileDialog box, and prompt user to select file to open
DialogResult openFileResult;
OpenFileDialog file = new OpenFileDialog() ;
file.Filter = "txt files (*.txt)|*.txt";
openFileResult = file.ShowDialog();
if (openFileResult == DialogResult.OK)
{
//If user selected file successfully opened
//Reset form
this.Controls.Clear();
this.InitializeComponent();
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
//Reset variables, and set-up chart
ResetVariables();
ChartInitialize();
foreach (string s in wordDictionary.Keys)
{
//Calculate statistics from dictionary
ComputeStatistics(s);
if (dpCount < 50)
{
AddToGraph(s);
}
}
//Print statistics
PrintStatistics();
}
函数是:
AddToDictionary(s)
此程序正在读取的文本文件是:
public void AddToDictionary(string s)
{
//Function to add string to dictionary
string wordLower = s.ToLower();
if (wordDictionary.ContainsKey(wordLower))
{
int wordCount = wordDictionary[wordLower];
wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
}
else
{
wordDictionary.Add(wordLower, 1);
txtUnique.Text += wordLower + ", ";
}
}
我遇到的问题是“have”这个词在字典中出现了两次。我知道不会发生在字典中,但出于某种原因,它出现了两次。有谁知道为什么会这样?
答案 0 :(得分:3)
如果你跑:
var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();
在调试器中检查words
表明,由于两个字节的CRLF行终止符,某些“test”实例已获得\r
- 这不是分割处理的。
要修复此问题,请将拆分更改为:
Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
答案 1 :(得分:0)
如果要支持多种语言,将文本拆分为单词通常很难解决。正则表达式在处理解析时通常比基本String.Split
更好。
即。在你的情况下,你正在挑选&#34;新线&#34;作为一个单词的一部分,你也可以选择像不间断的空间......,
以下代码会选择比当前.Split
更好的字词,以获取更多信息 - How do I split a phrase into words using Regex in C#
var words = Regex.Split(line, @"\W+").ToList();
此外,你应该确保你的词典不区分大小写(根据你的需要选择比较器,也有文化意识):
var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);
答案 2 :(得分:0)
我倾向于更改以下代码:
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
到此:
wordDictionary =
File
.ReadAllLines(file)
.SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
.Select(x => x.ToLower())
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
这完全避免了行结尾的问题,并且还具有额外的优势,即它不会留下任何不存在的流。