使用LINQ计算文本字段中的单词出现次数

时间:2009-10-17 16:18:13

标签: .net linq pattern-matching

如何使用LINQ获取数据库文本字段中Word的出现次数?

关键字标记示例:ASP.NET

编辑4:

数据库记录:

记录1:[TextField] =“Blah blah blah ASP.NET bli bli bli ASP.NET blu ASP.NET yop yop的 ASP.NET

记录2:[TextField] =“Blah blah blah bli bli bli blu ASP.NET yop yop ASP.NET

记录3:[TextField] =“Blah ASP.NET blah ASP.NET blah ASP.NET bli ASP。 NET bli bli ASP.NET blu ASP.NET yop yop ASP.NET

所以

记录1包含4次出现的“ASP.NET”关键字

记录2包含2次出现的“ASP.NET”关键字

记录3包含7次出现的“ASP.NET”关键字

集合提取IList< RecordModel> (按字数递减排序)

  • 记录3
  • 记录1
  • 记录2

LinqToSQL应该是最好的,但也是LinqToObject:)

注意:没有关于“。”的问题。 ASP.NET关键字(如果这个问题,这不是目标)

5 个答案:

答案 0 :(得分:4)

编辑2:我看到你更新了问题,改变了一些事情,每个字的字数呃?试试这个:

string input = "some random text: how many times does each word appear in some random text, or not so random in this case";
char[] separators = new char[]{ ' ', ',', ':', ';', '?', '!', '\n', '\r', '\t' };

var query = from s in input.Split( separators )
            where s.Length > 0
            group s by s into g
            let count = g.Count()
            orderby count descending
            select new {
                Word = g.Key,
                Count = count
            };

因为你想要可能有“。”的单词。在他们中(例如“ASP.NET”)我已经从分隔符列表中排除了它,不幸的是,它会将一些单词污染为“Blah blah blah.Brah blah”之类的句子。将显示“blah”,数量为3,“blah”。计数为2.您需要考虑您想要的清洁策略,例如如果“。”在任何一方都有一个字母,它算作一个单词的一部分,否则就是空白。这种逻辑最好用一些RegEx来完成。

答案 1 :(得分:3)

正则表达式处理得很好。您可以使用\b元字符来锚定单词边界,并转义关键字以避免意外使用特殊的正则表达式字符。它还处理尾随句号,逗号等的情况。

string[] records =
{
    "foo ASP.NET bar", "foo bar",
    "foo ASP.NET? bar ASP.NET",
    "ASP.NET foo ASP.NET! bar ASP.NET",
    "ASP.NET, ASP.NET ASP.NET, ASP.NET"
};
string keyword = "ASP.NET";
string pattern = @"\b" + Regex.Escape(keyword) + @"\b";
var query = records.Select((t, i) => new
            {
                Index = i,
                Text = t,
                Count = Regex.Matches(t, pattern).Count
            })
            .OrderByDescending(item => item.Count);

foreach (var item in query)
{
    Console.WriteLine("Record {0}: {1} occurrences - {2}",
        item.Index, item.Count, item.Text);
}

瞧! :)

答案 2 :(得分:1)

使用String.Split()将字符串转换为单词数组,然后使用LINQ过滤此列表,仅返回所需的单词,然后检查结果的计数,如下所示:

myDbText.Split(' ').Where(token => token.Equals(word)).Count();

答案 3 :(得分:0)

您可以Regex.Matches(input, pattern).Count或者您可以执行以下操作:

int count = 0; int startIndex = input.IndexOf(word);
while (startIndex != -1) { ++count; startIndex = input.IndexOf(word, startIndex + 1); }

在这里使用LINQ会很难看

答案 4 :(得分:0)

我知道这不仅仅是问题的原始问题,但它仍然符合主题,我将其包含在稍后搜索此问题的其他人中。这并不要求在搜索的字符串中匹配整个单词,但是可以使用Ahmad的帖子中的代码轻松修改它。

//use this method to order objects and keep the existing type
class Program
{
  static void Main(string[] args)
  {
    List<TwoFields> tfList = new List<TwoFields>();
    tfList.Add(new TwoFields { one = "foo ASP.NET barfoo bar", two = "bar" });
    tfList.Add(new TwoFields { one = "foo bar foo", two = "bar" });
    tfList.Add(new TwoFields { one = "", two = "barbarbarbarbar" });

    string keyword = "bar";
    string pattern = Regex.Escape(keyword);
    tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();

    foreach (TwoFields tf in tfList)
    {
      Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
    }

    Console.Read();
  }
}


//a class with two string fields to be searched on
public class TwoFields
{
  public string one { get; set; }
  public string two { get; set; }
}

//same as above, but uses multiple keywords
class Program
{
  static void Main(string[] args)
  {
    List<TwoFields> tfList = new List<TwoFields>();
    tfList.Add(new TwoFields { one = "one one, two; three four five", two = "bar" });
    tfList.Add(new TwoFields { one = "one one two three", two = "bar" });
    tfList.Add(new TwoFields { one = "one two three four five five", two = "bar" });

    string keywords = " five one    ";
    string keywordsClean = Regex.Replace(keywords, @"\s+", " ").Trim(); //replace multiple spaces with one space

    string pattern = Regex.Escape(keywordsClean).Replace("\\ ","|"); //escape special chars and replace spaces with "or"
    tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();

    foreach (TwoFields tf in tfList)
    {
      Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
    }

    Console.Read();
  }
}

public class TwoFields
{
  public string one { get; set; }
  public string two { get; set; }
}