基于Word存储字线和频率

时间:2015-04-30 17:25:51

标签: c# list dictionary frequency word

我正在研究一个问题,我必须能够读取文本文件,并计算特定单词的频率和行号。

例如,一个读取

的txt文件
"Hi my name is

Bob. This is 

Cool"

应该返回:

1 Hi 1

1 my 1

1 name 1

2 is 1 2

1 bob 2

1 this 2

1 cool 3

我无法确定如何存储行号以及单词频率。我尝试了一些不同的东西,到目前为止,这就是我所处的位置。

任何帮助?

        Dictionary<string, int> countDictionary = new Dictionary<string,int>();
        Dictionary<string, List<int>> lineDictionary = new Dictionary<string, List<int>>();

        List<string> lines = new List<string>();


        System.IO.StreamReader file =
                new System.IO.StreamReader("Sample.txt");

        //Creates a List of lines
        string x;
        while ((x = file.ReadLine()) != null)
        {
            lines.Add(x);
        }

        foreach(var y in Enumerable.Range(0,lines.Count()))
        {
            foreach(var word in lines[y].Split())
            {
                if(!countDictionary.Keys.Contains(word.ToLower()) && !lineDictionary.Keys.Contains(word.ToLower()))
                {
                    countDictionary.Add(word.ToLower(), 1);
                    //lineDictionary.Add(word.ToLower(), /*what to put here*/);
                }
                else
                {
                    countDictionary[word] += 1;
                    //ADD line to dictionary???
                }
            }
        }



       foreach (var pair in countDictionary)//WHAT TO PUT HERE to print both 
       {
           Console.WriteLine("{0}  {1}", pair.Value, pair.Key);
       }

        file.Close();


        System.Console.ReadLine();

2 个答案:

答案 0 :(得分:3)

你可以用一行linq

来做这件事
var processed =
  //get the lines of text as IEnumerable<string> 
  File.ReadLines(@"myFilePath.txt")
    //get a word and a line number for every word
    //so you'll have a sequence of objects with 2 properties
    //word and lineNumber
    .SelectMany((line, lineNumber) => line.Split().Select(word => new{word, lineNumber}))
    //group these objects by their "word" property
    .GroupBy(x => x.word)
    //select what you need
    .Select(g => new{
        //number of objects in the group
        //i.e. the frequency of the word
        Count = g.Count(), 
        //the actual word
        Word = g.Key, 
        //a sequence of line numbers of each instance of the 
        //word in the group
        Positions = g.Select(x => x.lineNumber)});

foreach(var entry in processed)
{
    Console.WriteLine("{0} {1} {2}",
                      entry.Count,
                      entry.Word,
                      string.Join(" ",entry.Positions));
}

我喜欢基于0的计数,因此您可能希望在适当的位置添加1。

答案 1 :(得分:1)

您正在两个独立的数据结构中跟踪实体“word”的两个不同属性。我建议创建一个表示该实体的类,如

public class WordStats
{
    public string Word { get; set; }
    public int Count { get; set; }
    public List<int> AppearsInLines { get; set; }
    public Word()
    {
        AppearsInLines = new List<int>();
    }
}

然后跟踪

中的内容
Dictionary<string, WordStats> wordStats = new Dictionary<string, WordStats>();

使用单词本身作为键。遇到新单词时,请检查是否已存在具有该特定键的Word实例。如果是这样,获取它并更新Count和AppearsInLines属性;如果没有创建新实例并将其添加到字典中。

foreach(var y in Enumerable.Range(0,lines.Count()))
{
    foreach(var word in lines[y].Split())
    {
        WordStats wordStat;
        bool alreadyHave = words.TryGetValue(word, out wordStat);
        if (alreadyHave)
        {
            wordStat.Count++;
            wordStat.AppearsInLines.Add(y);
        }
        else
        {
            wordStat = new WordStats();
            wordStat.Count = 1;
            wordStat.AppearsInLines.Add(y);
            wordStats.Add(word, wordStat);
        }