Question

我需要在字符串中找到相同的单词，我使用split方法将其拆分为单词，但是由于berhan与berhan,不同而收到错误。我将标点符号保留在一系列中，如何删除它们如果一个单词X（停用词除外）在文本中出现两次以上，计算机将询问“您是否喜欢X？假定停用词存储在以下数组中” ：

string[] stop_words = {″a″, ″after″, ″again, ″all″, ″am″, ″and″, ″any″, ″are″, ″as″, ″at″, ″be″, ″been″, ″before″, ″between″, ″both″, ″but″, ″by″, ″can″, ″could″, ″for″, ″from″, ″had″, ″has″, ″he″, ″her″, ″here″, ″him″, ″in″, ″into″, ”I”, ″is″, ″it″, ″me″, ″my″, ″of″, ″on″, ″our″, ″she″, ″so″, ″such″, ″than″, ″that″, ″the″, ″then″, ″they″, ″this″, ″to″, ″until″, ″we″, ″was″, ″were″, ″with″, ″you″}

例如输入：

你好，我有一把吉他，我的吉他是蓝色输出：你爱吉他

我使用分割方法，但“吉他”不等于“吉他”

Answer 1

我建议使用提取而不是 splitting （当您已经有多达 14 个标点符号时，就有可能存在第15 个，例如՜-U+055C 亚美尼亚感叹号）；尝试为此使用正则表达式：

  using System.Text.RegularExpressions;

  ...

  string source = @"A lot of words: there're some Russian ones (русские слова).";

  string[] words = Regex
    .Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
    .OfType<Match>()
    .Select(match => match.Value)
    .ToArray();

  Console.Write(string.Join(Environment.NewLine, words));

结果：

A
lot
of
words
there're
some
Russian
ones
русские
слова

如果您想找出相同（重复）的单词，请添加分组（GroupBy）以摆脱停用词-过滤（Where）：

  HashSet<string> stopWords = 
    new HashSet<string>(StringComparer.CurrentCultureIgnoreCase) {
      "is", "a", //TODO: put stopwords here 
  };

  string[] repeatedWords = Regex
    .Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
    .OfType<Match>()
    .Select(match => match.Value)
    .Where(word => !stopWords.Contains(word)) // not a stopword
    .GroupBy(word => word, StringComparer.CurrentCultureIgnoreCase)
    .Where(group => group.Count() > 2) // appeared more than 2 times
    .Select(group => group.Key)
    .ToArray();

编辑：我们实际上有多少个标点符号？

  int count = Enumerable
    .Range(0, char.MaxValue)
    .Count(c => char.IsPunctuation((char)c));

  Console.Write(count);

这可能会让您感到惊讶，但多达 593 （甚至不接近 14 ）

我如何忽略14个标点符号

1 个答案: