从txt文件中计算唯一单词的数量和每个单词的出现次数

时间:2015-09-02 20:22:37

标签: c# visual-studio text-processing

目前我试图创建一个应用程序来做一些文本处理来读取文本文件,然后我使用字典来创建单词索引,从技术上讲它将是这样的..程序将运行并读取文本文件然后检查它,看看该单词是否已经存在于该文件中,以及它作为唯一单词的id字。如果是这样,它将打印出他们遇到的每个单词的索引号和外观总数,并继续检查整个文件。并产生这样的东西:http://pastebin.com/CjtcYchF

以下是我正在输入的文本文件的示例:http://pastebin.com/ZRVbhWhV快速ctrl-F显示“not”发生2次,“that”发生4次。我需要做的是为每个单词编制索引并将其命名为:

sample input : "that I have not that place sunrise beach like not good dirty beach trash beach" 

    dictionary :            output.txt / output.dat:
    index word                     
      1    I                4:2 1:1 2:1 3:2 5:1 6:1 7:3 8:1 9:1 10:1 11:1
      2   have                   
      3   not                    
      4   that                   
      5   place                  
      6   sunrise
      7   beach
      8   like
      9   good
      10  dirty
      11  trash                  

我试图实现一些代码来创建字典。以下是我到目前为止的情况:

   private void bagofword_Click(object sender, EventArgs e)
            {
                //creating dictionary in background
                    //Dictionary<string, int> dict = new Dictionary<string, int>();
                    string rawinputbow = File.ReadAllText(textBox31.Text);
                    //string[] inputbow = rawinputbow.Split(' ');

                    var inputbow = rawinputbow.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
                                   .ToList();
                    var dict = new OrderedDictionary();
                    var output = new List<int>();

                    foreach (var element in inputbow.Select((word, index) => new { word, index }))
                    {

                        if (dict.Contains(element.word))
                        {
                            var count = (int)dict[element.word];
                            dict[element.word] = ++count;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = output.ToString();
                           // textBoxfile.Text = inputbow.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }
                        else
                        {
                            dict[element.word] = 1;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = dict.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }

                    }
    }

    public int GetIndex(OrderedDictionary dictionary, string key)
            {
                for (int index = 0; index < dictionary.Count; index++)
                {
                    if (dictionary[index] == dictionary[key])                   
                        return index; // We found the item       
                        //textBoxfile.Text = index.ToString();
                }

                return -1;
            }

有谁知道如何填写该代码?非常感谢任何帮助!

3 个答案:

答案 0 :(得分:2)

在空格上分裂是不够的。您有temple, photos.cafes/restaraunts等字词。更好的方法是使用像\w+这样的正则表达式。这些词也应该以不区分大小写的方式进行比较。

我的方法是:

var words = Regex.Matches(File.ReadAllText(filename), @"\w+").Cast<Match>()
            .Select((m, pos) => new { Word = m.Value, Pos = pos })
            .GroupBy(s => s.Word, StringComparer.CurrentCultureIgnoreCase)
            .Select(g => new { Word = g.Key, PosInText = g.Select(z => z.Pos).ToList() })
            .ToList();


foreach(var item in words)
{
    Console.WriteLine("{0,-15} POS:{1}", item.Word, string.Join(",", item.PosInText));
}


for (int i = 0; i < words.Count; i++)
{
    Console.Write("{0}:{1} ", i, words[i].PosInText.Count);
} 

答案 1 :(得分:0)

BETWEEN

答案 2 :(得分:-1)

使用此代码

  string input = "that I have not that place sunrise beach like not good dirty beach trash beach";
        var wrodList = input.Split(null);
        var output = wrodList.GroupBy(x => x).Select(x => new Word { charchter = x.Key, repeat = x.Count() }).OrderBy(x=>x.repeat);
        foreach (var item in output)
        {
            textBoxfile.Text += item.charchter +" : "+ item.repeat+Environment.NewLine;
        }

保存数据的类

 public class word
    {
        public string  charchter { get; set; }
        public int repeat { get; set; }
    }