阅读文件时忽略某些单词

时间:2015-05-09 10:17:28

标签: java readfile

我的程序读取文本文件并列出文件中每个单词的频率。接下来我需要做的是在阅读文件时忽略某些单词,例如'the','an'。我有一个创建这些单词的列表,但不知道如何在while循环中实现它。感谢。

public static String [] ConnectingWords = {"and", "it", "you"};

public static void readWordFile(LinkedHashMap<String, Integer> wordcount) {
    // FileReader fileReader = null;
    Scanner wordFile;
    String word; // A word read from the file
    Integer count; // The number of occurrences of the word

    // LinkedHashMap <String, Integer> wordcount = new LinkedHashMap<String, Integer> ();

    try {
        wordFile = new Scanner(new FileReader("/Applications/text.txt"));
        wordFile.useDelimiter(" ");
    } catch (FileNotFoundException e) {
        System.err.println(e);
        return;
    }
    while (wordFile.hasNext()) {
        word = wordFile.next();
        word = word.toLowerCase();

        if (word.contains("the")) {
            count = getCount(word, wordcount) + 0;
            wordcount.put(word, count);

        }
        // Get the current count of this word, add one, and then store the
        // new count:
        count = getCount(word, wordcount) + 1;
        wordcount.put(word, count);
    }
}

3 个答案:

答案 0 :(得分:2)

创建一个列表,其中包含需要忽略的单词列表:

List<String> ignoreAll= Arrays.asList("and","it", "you");

然后在while循环中添加一个将忽略单词的条件包含这些单词

if(ignoreAll.contains(word)){
                 continue;

            }

答案 1 :(得分:2)

您可以尝试以下代码。

   public static HashSet<String> connectingWords;
    public static Map<String,Integer> frequencyMap;

    static  {
        connectingWords = new HashSet<>();
        connectingWords.add("and");
        connectingWords.add("it");
        connectingWords.add("you");
        frequencyMap = new HashMap<>();
    }

    public static void main(String[] args) {
        BufferedReader reader = null;
        String line;
        try {
            reader = new BufferedReader(new FileReader("src/files/temp2.txt"));
            while ((line = reader.readLine()) != null) {
                String[] words = line.split("-");
                for (String word : words) {
                    if(connectingWords.contains(word)) {
                        continue;
                    }
                    Integer value = frequencyMap.get(word);
                    if(value != null) {
                        frequencyMap.put(word,value+1);
                    } else {
                        frequencyMap.put(word,0);
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            reader.close();
        }
        System.out.println(frequencyMap.values());

    }

最好在HashSet中存储连接字,因为每次为文件中的每个单词调用contains时,它都会提供快速访问。该词及其频率也可以保持在Map。另外我假设单词的分隔符为-,如果是其他内容则可以修改代码。此外,如果您有任何与case相关的特殊要求,您可以更改代码。我已尝试使用What-the-hell-is-going-on-and-it-is-good输入的文件,它工作正常。

答案 2 :(得分:0)

有排除列表的列表单词。在更新计数之前,请检查排除列表。

public static void readWordFile (LinkedHashMap<String, Integer> wordcount) {

    List<String> excludeList = new ArrayList<>();
    excludeList.add("the"); // and so on
    //  FileReader fileReader = null;
    Scanner wordFile;
    String word;     // A word read from the file
    Integer count;   // The number of occurrences of the word

    //  LinkedHashMap <String, Integer> wordcount = new LinkedHashMap <String, Integer> ();

    try
    {
        wordFile = new Scanner(new FileReader("/Applications/text.txt"));
        wordFile.useDelimiter(" ");
    }
    catch (FileNotFoundException e)
    {
        System.err.println(e);
        return;
    }
    while (wordFile.hasNext())
    {
        word = wordFile.next( );
        word = word.toLowerCase();

        if(!excludeList.contains(word)) {

        count = wordcount.get(word) + 1;
        wordcount.put(word, count);
        }

    }