我需要在字符串中找到相同的单词,我使用split方法将其拆分为单词,但是由于berhan
与berhan,
不同而收到错误。我将标点符号保留在一系列中,如何删除它们如果一个单词X(停用词除外)在文本中出现两次以上,计算机将询问“您是否喜欢X?假定停用词存储在以下数组中” :
string[] stop_words = {″a″, ″after″, ″again, ″all″, ″am″, ″and″, ″any″, ″are″, ″as″, ″at″, ″be″, ″been″, ″before″, ″between″, ″both″, ″but″, ″by″, ″can″, ″could″, ″for″, ″from″, ″had″, ″has″, ″he″, ″her″, ″here″, ″him″, ″in″, ″into″, ”I”, ″is″, ″it″, ″me″, ″my″, ″of″, ″on″, ″our″, ″she″, ″so″, ″such″, ″than″, ″that″, ″the″, ″then″, ″they″, ″this″, ″to″, ″until″, ″we″, ″was″, ″were″, ″with″, ″you″}
例如输入:
你好,我有一把吉他,我的吉他是蓝色输出:你爱吉他
我使用分割方法,但“吉他”不等于“吉他”
答案 0 :(得分:1)
我建议使用提取而不是 splitting (当您已经有多达 14 个标点符号时,就有可能存在第15 个,例如՜
-U+055C
亚美尼亚感叹号);尝试为此使用正则表达式:
using System.Text.RegularExpressions;
...
string source = @"A lot of words: there're some Russian ones (русские слова).";
string[] words = Regex
.Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.Write(string.Join(Environment.NewLine, words));
结果:
A
lot
of
words
there're
some
Russian
ones
русские
слова
如果您想找出相同(重复)的单词,请添加分组(GroupBy
)以摆脱停用词-过滤(Where
):
HashSet<string> stopWords =
new HashSet<string>(StringComparer.CurrentCultureIgnoreCase) {
"is", "a", //TODO: put stopwords here
};
string[] repeatedWords = Regex
.Matches(source, @"[\p{L}']+") // word is composed from letters and apostrophes
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word)) // not a stopword
.GroupBy(word => word, StringComparer.CurrentCultureIgnoreCase)
.Where(group => group.Count() > 2) // appeared more than 2 times
.Select(group => group.Key)
.ToArray();
编辑:我们实际上有多少个标点符号?
int count = Enumerable
.Range(0, char.MaxValue)
.Count(c => char.IsPunctuation((char)c));
Console.Write(count);
这可能会让您感到惊讶,但多达 593 (甚至不接近 14 )