从文件中删除停用词 - 多次重复检查会导致内容重复,并且不会删除单词

时间:2018-04-25 19:14:37

标签: java file-io stop-words

我正在尝试查看一堆文件,阅读每个文件,并使用这些单词从指定列表中删除所有停用词。结果是一场灾难 - 整个文件的内容一次又一次地被复制。

我尝试了什么:
- 将文件保存为String并尝试使用正则表达式查找
- 将文件保存为String并逐行检查并将标记与存储在LinkedHashSet中的停用词进行比较,我也可以将它们存储在一个文件中
- 试图以多种方式扭曲下面的逻辑,得到越来越荒谬的输出。
- 尝试使用.contains()方法查看文本/行,但没有运气

我的一般逻辑如下:

for every word in the stopwords set:
    while(file has more lines):
        save current line into String
        while (current line has more tokens):
            assign current token into String
            compare token with current stopword:
                if(token equals stopword):
                     write in the output file "" + " " 
                else: write in the output file the token as is

Tried what's in this question以及许多其他SO问题,但是无法实现我的需要。

以下真实代码:

private static void removeStopWords(File fileIn) throws IOException {
        File stopWordsTXT = new File("stopwords.txt");
        System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");

        // create file reader and go over it to save the stopwords into the Set data structure
        BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
        Set<String> stopWords = new LinkedHashSet<String>();

        for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
            // trim() eliminates leading and trailing spaces
            stopWords.add(line.trim());
        }           

        File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
        FileWriter fOut = new FileWriter(outp);

        Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
        while(readerTxt.hasNextLine()) {
            String line = readerTxt.nextLine();
            System.out.println(line);
            Scanner lineReader = new Scanner(line);

            for (String curSW : stopWords) {
                while(lineReader.hasNext()) {
                    String token = lineReader.next();
                    if(token.equals(curSW)) {
                        System.out.println("---> Removing SW: " + curSW);
                        fOut.write("" + " ");
                    } else {
                        fOut.write(token + " ");
                    }
                }
            }
            fOut.write("\n");
        }       
        fOut.close();
}

最常见的是,它会从stopWords集中查找第一个单词,就是这样。输出包含所有其他单词,即使我设法删除第一个单词。第一个将在最后的附加输出中出现。

我的禁用词列表的一部分

about
above
after
again
against
all
am
and
any
are
as
at

使用令牌我的意思是单词,即从行中获取每个单词并将其与当前的禁用词进行比较

1 个答案:

答案 0 :(得分:0)

经过一段时间的调试后,我相信我找到了解决方案。这个问题非常棘手,因为你必须使用几种不同的扫描仪和文件阅读器等。这就是我所做的:

我更改了您添加到StopWords集的方式,因为它没有正确添加它们。我使用缓冲读取器读取每一行,然后使用扫描仪读取每个单词,然后将其添加到集合中。

然后当你比较它们时,我摆脱了你的一个循环,因为你可以很容易地使用.contains()方法检查这个单词是否是一个停止词。

我让你做了写文件的部分来取出停止词,因为我确定你现在可以解决所有其他问题。

- 我的样本停止词txt文件: 停止说话 字

- 我的样本输入文件完全相同,所以它应该捕获所有三个单词。

代码:

// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
 // trim() eliminates leading and trailing spaces
 Scanner words = new Scanner(stopWordsLine);
 String word = words.next();
 while(word != null) {
       stopWords.add(word.trim());   //Add the stop words to the set

       if(words.hasNext()) {
             word = words.next();   //If theres another line, read it
       }
       else {
            break;    //else break the inner while loop
       }

}

stopWordsLine = readerSW.readLine();
}           

BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();

while(line != null) {

 Scanner lineReader = new Scanner(line);
 String line2 = lineReader.next();
 while(line2 != null) {
     if(stopWords.contains(line2)) {
           System.out.println("removing " + line2);
         }
     if(lineReader.hasNext()) { //If theres another line, read it
        line2 = lineReader.next();
      }
      else {
           break;       //else break the first while loop
      }

}

lineReader.close();
    line = outp.readLine();
} 

OutPut:

removing Stop

removing words

removing Words

如果我能详细说明我的代码或我为什么做某事,请告诉我!