Question

我正在构建一个程序，该程序读取停用词的文本文件，然后读取从Twitter收集的推文的文本文件。我正试图从推文集中删除停用词，这样我只剩下“有趣”词汇，然后将它们打印到控制台。

然而，没有任何东西打印到控制台，所以很明显它不能正常工作......它在输入test.txt文件之前工作（当我使用在程序中创建的字符串时，分裂它，然后将其存储在一个数组中）。

有关读取test.txt文件并提取停用词，然后将listOfWords列表打印到控制台的任何帮助。

任何帮助将不胜感激

import java.util.*;
import java.io.*;

public class RemoveStopWords {

  public static void main(String[] args) {

    try {
    Scanner stopWordsFile = new Scanner(new File("stopwords_twitter.txt"));
    Scanner textFile = new Scanner(new File("Test.txt"));

    // Create a set for the stop words (a set as it doesn't allow duplicates)
    Set<String> stopWords = new HashSet<String>();
    // For each word in the file
    while (stopWordsFile.hasNext()) {
        stopWords.add(stopWordsFile.next().trim().toLowerCase());
    }

    // Splits strings and stores each word into a list
    ArrayList<String> words = new ArrayList<String>();
    while (stopWordsFile.hasNext()) {
        words.add(textFile.next().trim().toLowerCase());
    }

    // Create an empty list (a list because it allows duplicates) 
    ArrayList<String> listOfWords = new ArrayList<String>();

    // Iterate over the array 
    for(String word : words) {
        // Converts current string index to lowercase
        String toCompare = word.toLowerCase();
        // If the word isn't a stop word, add to listOfWords list
        if (!stopWords.contains(toCompare)) {
            listOfWords.add(word);
        }
    }

    stopWordsFile.close();
    textFile.close();

    for (String str : listOfWords) {
        System.out.print(str + " ");
    }
    } catch(FileNotFoundException e){
        e.printStackTrace();
    }
}
}

Answer 1

您有两个while (stopWordsFile.hasNext())，第二个将始终返回false：

// For each word in the file
while (stopWordsFile.hasNext()) {
    stopWords.add(stopWordsFile.next().trim().toLowerCase());
}

// Splits strings and stores each word into a list
ArrayList<String> words = new ArrayList<String>();
while (stopWordsFile.hasNext()) {
    words.add(textFile.next().trim().toLowerCase());
}

你应该使用

while (textFile.hasNext())

代替

while (stopWordsFile.hasNext())

在第二个。

Answer 2

问题是你要两次从文件中读取文字：

while (stopWordsFile.hasNext()) { // this will never execute as stopWordsFile has no nextElement left
        words.add(textFile.next().trim().toLowerCase());
}

因此，将第二个条件更改为：

while (textFile.hasNext()) { 
    words.add(textFile.next().trim().toLowerCase());
}

Answer 3

通过逐行读取文件将文件复制到另一个文件中，如果你有一行包含＆＃39;停用词＆＃39;如果是这种情况你将它从行中删除，你复制文件中的行，否则复制该行

读取文本文件并使用集合和列表删除单词

3 个答案: