我试图从任何字符串数组中删除所有连词和代词(让我们调用该数组A),从文本文件中读取要删除的单词并将其转换为字符串数组(让我们调用数组B)。
我需要的是获取数组A的第一个元素并将其与数组B中的每个单词进行比较,如果单词匹配,我想删除数组A中的单词。
例如:
数组A = [0]我[1]想要[2]到[3]去[4]回家[5]和[6]睡眠
数组B = [0] I [1]和[2]转到[3]到
结果=数组A = [0]想要[1]回家[2]睡眠
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
如果可能的话,我需要数组A没有重复/不同的值。
答案 0 :(得分:2)
您正在寻找IEnumerable.Except,其中传递的参数应用于输入序列,并且参数列表中不存在的输入序列的每个元素仅返回
例如
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
并且适用于您的代码是:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}
答案 1 :(得分:0)
您想要删除停用词。你可以借助 Linq :
来做到这一点 ...
string filePath = @"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
请注意,我已保留您在代码中使用的Split()
,因此您拥有玩具实施。在现实生活中,你至少要考虑标点符号,这就是更好的代码使用正则表达式的原因:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, @"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
答案 2 :(得分:0)
您的代码中可能已经有了答案。我确信您的代码可以清理一下,因为我们的所有代码都可以。你循环遍历articleSplit并拉出每个单词。然后逐个将该单词与wordsToBeRemoved数组中的单词逐个进行比较。您使用条件进行比较,当为真时,您从原始数组中删除项目,或者至少尝试。
我会创建另一个结果数组,然后显示,使用或者你想要的数组减去要排除的单词。 循环通过articleSplit arcticle分裂中的foreach x 用语言预言toBeRemoved 如果x!= y newArray.Add(x)
然而,这是相当多的工作。您可能希望使用array.filter然后添加该方式。有一百种方法可以实现这一目标。
以下是一些有用的文章: filter an array in C# https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx 这些将使您免于所有循环。