我如何忽略14个标点符号

时间:2018-12-12 20:15:00

标签: c#

我需要在字符串中找到相同的单词,我使用split方法将其拆分为单词,但是由于berhanberhan,不同而收到错误。我将标点符号保留在一系列中,如何删除它们如果一个单词X(停用词除外)在文本中出现两次以上,计算机将询问“您是否喜欢X?假定停用词存储在以下数组中” :

string[] stop_words = {″a″, ″after″, ″again, ″all″, ″am″, ″and″, ″any″, ″are″, ″as″, ″at″, ″be″, ″been″, ″before″, ″between″, ″both″, ″but″, ″by″, ″can″, ″could″, ″for″, ″from″, ″had″, ″has″, ″he″, ″her″, ″here″, ″him″, ″in″, ″into″, ”I”, ″is″, ″it″, ″me″, ″my″, ″of″, ″on″, ″our″, ″she″, ″so″, ″such″, ″than″, ″that″, ″the″, ″then″, ″they″, ″this″, ″to″, ″until″, ″we″, ″was″, ″were″, ″with″, ″you″} 

例如输入:

  

你好,我有一把吉他,我的吉他是蓝色输出:你爱吉他

我使用分割方法,但“吉他”不等于“吉他”

1 个答案:

答案 0 :(得分:1)

我建议使用提取而不是 splitting (当您已经有多达 14 个标点符号时,就有可能存在第15 个,例如՜-U+055C 亚美尼亚感叹号);尝试为此使用正则表达式

  using System.Text.RegularExpressions;

  ...

  string source = @"A lot of words: there're some Russian ones (русские слова).";

  string[] words = Regex
    .Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
    .OfType<Match>()
    .Select(match => match.Value)
    .ToArray();

  Console.Write(string.Join(Environment.NewLine, words)); 

结果:

A
lot
of
words
there're
some
Russian
ones
русские
слова

如果您想找出相同(重复)的单词,请添加分组GroupBy)以摆脱停用词-过滤Where):

  HashSet<string> stopWords = 
    new HashSet<string>(StringComparer.CurrentCultureIgnoreCase) {
      "is", "a", //TODO: put stopwords here 
  };

  string[] repeatedWords = Regex
    .Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
    .OfType<Match>()
    .Select(match => match.Value)
    .Where(word => !stopWords.Contains(word)) // not a stopword
    .GroupBy(word => word, StringComparer.CurrentCultureIgnoreCase)
    .Where(group => group.Count() > 2) // appeared more than 2 times
    .Select(group => group.Key)
    .ToArray();

编辑:我们实际上有多少个标点符号?

  int count = Enumerable
    .Range(0, char.MaxValue)
    .Count(c => char.IsPunctuation((char)c));

  Console.Write(count);

这可能会让您感到惊讶,但多达 593 (甚至不接近 14