Question

我读了两个文本文件：第一个包含阿拉伯语文本，我把它分开了。第二个包含停用词。我想从第一个文件中删除任何停用词（在第二个文件中），但我不知道如何执行此操作：

FileStream fs = new FileStream(@"H:\\arabictext.txt", FileMode.Open);
StreamReader arab = new StreamReader(fs,Encoding.Default,true);
string artx = arab.ReadToEnd();
richTextBox1.Text = artx;
arab.Close();
char[] dele = {' ', ',', '.', '\t', ';','#','!' };

string[] words = richTextBox1.Text.Split(dele);

FileStream fsw = new FileStream("H:\\arab.txt", FileMode.Create);
StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);

foreach (string s in words)
{
    arabw.WriteLine(s);
}

Answer 1

如果我理解正确，您希望从第一个文件中找到停用词，并从第二个文件中删除这些停用词。

以下是我的解决方法：

从第一个文件
从第一个文件中迭代提取的单词，并将其替换为第二个文件内容中的String.Empty。
保存文件

我将您的代码简化为以下代码：

        // read file contents
        var fileContent1 = System.IO.File.ReadAllText("file1.txt");
        var fileContent2 = System.IO.File.ReadAllText("file2.txt");

        // extract stop-words from first file
        var words = fileContent1.Split(new char[] { ' ', ',', '.', '\t', ';', '#', '!' })
                                .Distinct();

        // rmeove stop words in file2
        foreach (var word in words)
            fileContent2.Replace(word, string.Empty);

        System.IO.File.WriteAllText("file2.txt", fileContent2);

Answer 2

我为我的问题找到了解决方案.. 你有更好的解决方案吗？

        char[] dele = { ' ', ',', '.', '\t', ';', '#', '!' };
        using (TextWriter tw = new StreamWriter(@"H:\output.txt"))
        {
            using (StreamReader reader = new StreamReader("H:\\arabictext.txt",Encoding.Default,true))
            {
                string line;

                while ((line = reader.ReadLine()) != null)
                {
                    string[] stopWord = new string[] { "قد", "في", "بيت", "فواصل", "هي", "من","$","ُ","ِ","ُ","ّ","ٍ","ٌ","ْ","ً" };


                    foreach (string word in stopWord)
                    {

                        line = line.Replace(word, "");

                    }

                    tw.Write(line);


                }
            }
        }
        FileStream fs = new FileStream(@"H:\\output.txt", FileMode.Open);
        StreamReader arab = new StreamReader(fs,Encoding.Default,true);
        string artx = arab.ReadToEnd();
        arab.Close();
        string[] words = artx.Split(dele);

        FileStream fsw = new FileStream("H:\\result.txt", FileMode.Create);
        StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);
        foreach (string s in words)
        {

         arabw.WriteLine(s);

        }
        arabw.Close();
        arab.Close();

从C＃中的文本文件中删除停用词

2 个答案: