从文本文件中删除停用词

时间:2013-03-14 18:19:43

标签: c# regex c#-4.0

我想从我的文本文件中删除停用词,并为此目的编写以下代码

 TextWriter tw = new StreamWriter("D:\\output.txt");
 private void button1_Click(object sender, EventArgs e)
        {
            StreamReader reader = new StreamReader("D:\\input1.txt");
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                string[] parts = line.Split(' ');
                string[] stopWord = new string[] { "is", "are", "am","could","will" };
                foreach (string word in stopWord)
                {
                    line = line.Replace(word, "");
                    tw.Write("+"+line);
                }
                tw.Write("\r\n");
            } 

但它没有在输出文件中显示结果,输出文件仍为空。

4 个答案:

答案 0 :(得分:6)

正则表达式可能非常适合这项工作:

        Regex replacer = new Regex("\b(?:is|are|am|could|will)\b");
        using (TextWriter writer = new StreamWriter("C:\\output.txt"))
        {
            using (StreamReader reader = new StreamReader("C:\\input.txt"))
            {
                while (!reader.EndOfStream)
                {
                    string line = reader.ReadLine();
                    replacer.Replace(line, "");
                    writer.WriteLine(line);
                }
            }
            writer.Flush();
        }

这种方法只会替换空白的单词,如果它们是另一个单词的一部分,则不对截止词做任何处理。

祝你好运。

答案 1 :(得分:2)

以下按预期方式工作。然而,这不是一个好方法,因为它会删除停用词,即使它们是较大词的一部分。此外,它不会清除被删除单词之间的额外空格。

string[] stopWord = new string[] { "is", "are", "am","could","will" };

TextWriter writer = new StreamWriter("C:\\output.txt");
StreamReader reader = new StreamReader("C:\\input.txt");

string line;
while ((line = reader.ReadLine()) != null)
{
    foreach (string word in stopWord)
    {
        line = line.Replace(word, "");
    }
    writer.WriteLine(line);
}
reader.Close();
writer.Close();

此外,我建议您在创建流时使用using语句,以确保及时关闭文件。

答案 2 :(得分:1)

您应该将IO对象包装在using语句中,以便正确处理它们。

using (TextWriter tw = new TextWrite("D:\\output.txt"))
{
    using (StreamReader reader = new StreamReader("D:\\input1.txt"))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            string[] parts = line.Split(' ');
            string[] stopWord = new string[] { "is", "are", "am","could","will" };
            foreach (string word in stopWord)
            {
                line = line.Replace(word, "");
                tw.Write("+"+line);
            }
        }
    }
}

答案 3 :(得分:0)

尝试在StreamWriter子句中包装StreamReaderusing() {}

using (TextWriter tw = new StreamWriter(@"D:\output.txt")
{
  ...
}

您可能还想在最后致电tw.Flush()