循环遍历字符串并删除任何指定单词的出现

时间:2017-05-15 19:52:54

标签: c# arrays asp.net-mvc loops foreach

我试图从任何字符串数组中删除所有连词和代词(让我们调用该数组A),从文本文件中读取要删除的单词并将其转换为字符串数组(让我们调用数组B)。

我需要的是获取数组A的第一个元素并将其与数组B中的每个单词进行比较,如果单词匹配,我想删除数组A中的单词。

例如:

数组A = [0]我[1]想要[2]到[3]去[4]回家[5]和[6]睡眠
数组B = [0] I [1]和[2]转到[3]到

结果=数组A = [0]想要[1]回家[2]睡眠

//remove any duplicates,conjunctions and Pronouns
        public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
        {
            //get words to be removed
            string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
            //split word into array of strings 
            string[] wordsToBeRemoved = text.Split(',');
            //all articles
            foreach (var article in myArticles)
            {
               //split articles into words
                string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
                //loop through array of articles words
                foreach (var y in articleSplit)
                {
                    //loop through words to be removed from articleSplit
                    foreach (var x in wordsToBeRemoved)
                    {
                        //if word of articles matches word to be removed, remove word from article
                        if (y == x)
                        {
                            //get index of element in array to be removed
                            int g = Array.IndexOf(articleSplit,y);
                            //assign elemnt to ""
                            articleSplit[g] = "";
                        }
                    }
                }
                //re-assign splitted article to string
                article.ArticleContent = articleSplit.ToString();
            }
            return myArticles;
        }

如果可能的话,我需要数组A没有重复/不同的值。

3 个答案:

答案 0 :(得分:2)

您正在寻找IEnumerable.Except,其中传递的参数应用于输入序列,并且参数列表中不存在的输入序列的每个元素仅返回

例如

string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};

var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
    Console.WriteLine(s);

// ---- Output ----
want
this
string
returned
without
words   <<-- This appears only once
only
occurence

并且适用于您的代码是:

string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
    string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
    var result = articleSplit.Except(wordsToBeRemoved);
    article.ArticleContent = string.Join(" ", result);
}

答案 1 :(得分:0)

您想要删除停用词。你可以借助 Linq

来做到这一点
  ...
  string filePath = @"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";

  // Hashset is much more efficient than array in the context
  HashSet<string> stopWords = new HashSet<string>(File
    .ReadLines(filePath), StringComparer.OrdinalIgnoreCase);

  foreach (var article in myArticles) {
    // read article, split into words, filter out stop words... 
    var cleared = article
      .ArticleContent
      .Split(' ')
      .Where(word => !stopWords.Contains(word));

    // ...and join words back into article
    article.ArticleContent = string.Join(" ", cleared);  
  }
  ...

请注意,我已保留您在代码中使用的Split(),因此您拥有玩具实施。在现实生活中,你至少要考虑标点符号,这就是更好的代码使用正则表达式的原因:

  foreach (var article in myArticles) {
    // read article, extract words, filter out stop words... 
    var cleared = Regex
      .Matches(article.ArticleContent, @"\w+") // <- extract words
      .OfType<Match>()
      .Select(match => match.Value)
      .Where(word => !stopWords.Contains(word));

    // ...and join words back into article
    article.ArticleContent = string.Join(" ", cleared);  
  }

答案 2 :(得分:0)

您的代码中可能已经有了答案。我确信您的代码可以清理一下,因为我们的所有代码都可以。你循环遍历articleSplit并拉出每个单词。然后逐个将该单词与wordsToBeRemoved数组中的单词逐个进行比较。您使用条件进行比较,当为真时,您从原始数组中删除项目,或者至少尝试。

我会创建另一个结果数组,然后显示,使用或者你想要的数组减去要排除的单词。 循环通过articleSplit arcticle分裂中的foreach x     用语言预言toBeRemoved        如果x!= y newArray.Add(x)

然而,这是相当多的工作。您可能希望使用array.filter然后添加该方式。有一百种方法可以实现这一目标。

以下是一些有用的文章: filter an array in C# https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx 这些将使您免于所有循环。