如何从txt文件计算单词的频率 - Java

时间:2016-11-19 16:38:57

标签: java frequency word-count

我需要一些关于此代码的帮助。我希望我的程序计算从描述的模式匹配的每个单词的频率。

public class Project {
    public static void main(String[] args) throws FileNotFoundException{
    Scanner INPUT_TEXT = new Scanner(new File("moviereview.txt")).useDelimiter(" ");

    String pattern = "[a-zA-Z'-]+";
    Pattern r = Pattern.compile(pattern);

    int occurences=0;

    while(INPUT_TEXT.hasNext()){
        //read next word
        String Stringcandidate=INPUT_TEXT.next();   

        //see if pattern matches (boolean find)
        if(r.matcher(Stringcandidate).find()) {
            occurences++; //increment occurences if pattern is found
            String moviereview = m.group(0); //retrieve found string
            String moviereview2 = moviereview.toLowerCase(); // ???

            System.out.println(moviereview2 + " appears " + occurences);
            if(occurences>1){
                 System.out.println(" times\n");
            }
            else{
                System.out.println(" time\n");
            }
        }
        INPUT_TEXT.close();//Close your Scanner.     
    }

}

1 个答案:

答案 0 :(得分:1)

正如我之前的评论中所述,可以使用Map(HashMap)来存储匹配的单词及其出现/频率。

我建议将程序的功能封装到较小的方法/类中,以便每个方法/类只执行一项小任务。因此可以更好地阅读代码。

我假设你的文件中包含了字符串"自动丛林在矮牵牛车中胜过她的番茄"

以下是代码:

package how_to_calculate_the_frequency;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Project {

    HashMap<String, Integer> map = new HashMap<String, Integer>();

    public static void main(String[] args){

        Project project = new Project();

        Scanner INPUT_TEXT = project.readFile();

        project.analyse(INPUT_TEXT);

        project.showResults();

    }

    /**
     * logic to count the occurences of words matched by REGEX in a scanner that
     * loaded some text
     * 
     * @param scanner
     *            the scanner holding the text
     */
    public void analyse(Scanner scanner) {

        String pattern = "[a-zA-Z'-]+";
        Pattern r = Pattern.compile(pattern);

        while (scanner.hasNext()) {
            // read next word
            String Stringcandidate = scanner.next();

            // see if pattern matches (boolean find)
            Matcher matcher = r.matcher(Stringcandidate);
            if (matcher.find()) {
                String matchedWord = matcher.group();
                //System.out.println(matchedWord); //check what is matched
                this.addWord(matchedWord);

            }

        }
        scanner.close();// Close your Scanner.
    }

    /**
     * adds a word to the <word,count> Map if the word is new, a new entry is
     * created, otherwise the count of this word is incremented
     */
    public void addWord(String matchedWord) {

        if (map.containsKey(matchedWord)) {
            // increment occurrence
            int occurrence = map.get(matchedWord);
            occurrence++;
            map.put(matchedWord, occurrence);
        } else {
            // add word and set occurrence to 1
            map.put(matchedWord, 1);
        }

    }

    /**
     * reads a file from disk and returns a scanner to analyse it
     * 
     * @return the file from disk as scanner
     */
    public Scanner readFile() {

        Scanner scanner = null;

        /* use that for reading a file from disk
         * try { scanner = new Scanner(new
         * File("moviereview.txt")).useDelimiter(" "); } catch (Exception e) {
         * e.printStackTrace(); }
         */

        scanner = new Scanner("auto bush trumped her tomato in the petunia auto");

        return scanner;
    }

    /**
     * prints the matched words and their occurrences
     * in a readable way
     */
    public void showResults() {

        for (HashMap.Entry<String, Integer> matchedWord : map.entrySet()) {
            int occurrence = matchedWord.getValue();
            System.out.print("\"" + matchedWord.getKey() + "\" appears " + occurrence);
            if (occurrence > 1) {
                System.out.print(" times\n");
            } else {
                System.out.print(" time\n");
            }
        }

        // or as the new Java 8 lambda expression
        // map.forEach((word,occurrence)->System.out.println("\"" + word + "\"
        // appears " + occurrence + " times"));
    }
}

// DONE seperate reading a file, analysing the file and
// word-frequency-counting-logic in different
// methods
// Done implement <word,count> Map and logic to add new and known(to the map)
// words

这会产生:

&#34;所述&#34;出现1次

&#34;自动&#34;出现2次

&#34;她的&#34;出现1次

&#34;在&#34;出现1次

&#34;衬套&#34;出现1次

&#34;捏造&#34;出现1次

&#34;番茄&#34;出现1次

&#34;矮牵牛&#34;出现1次

问候