如何在文本分析中消除文本文件中的空格?

时间:2015-03-23 20:14:58

标签: java frequency analysis

我正在尝试让我的程序显示文本文件中字母的频率,此时它显示文本文件中每个单词的频率。因此,例如,如果文本文件中的单词是“我是男人”,则为每个单词“i”,“am”,“a”,“man”输出4x字母频率..我需要它来分析它所有作为一个单词,所以删除空格,并将其视为“iamaman”。

//

6 个答案:

答案 0 :(得分:1)

在文本中包含空格不是问题。实际上,在您添加到计数之前检查Character.isLetter()时,您已经在忽略空格。

主要是你需要将你的forwhile循环放在主循环之外进行最终计数,迭代遍历标记。

import java.util.*;
import java.io.*;

public class J_<countlettersfilereader> {

    public static void main(String[] args)throws Exception {
        // open the file
        Scanner console = new Scanner(System.in);
        System.out.print("What is the name of the text file? ");
        String fileName = console.nextLine();
        Scanner input = new Scanner(new File(fileName));

        //initialize array with 26 elements
        int[] letterArray = new int[26]; 

        while (input.hasNext()) {
            String next = input.next().toLowerCase();

            //run loop for each line incrementing per character
            for (int i = 0; i < next.length(); i++) {
                char characters = next.charAt(i);

                //ignore all characters which aren't alphabetic 
                if (Character.isLetter(characters)) {

                    //if character is uppercase then convert to lowercase
                    characters = Character.toLowerCase(characters);

                    //populate array 
                    int index = characters - 'a';
                    letterArray[index]++;
                }}
        }

        int total = 0;
        for(int i = 0; i < letterArray.length; i ++) {
            total += letterArray[i];
        }

        for (char characters = 'a'; characters <= 'z'; characters++) {
            int index = characters - 'a';
            //print out the analysis
            System.out.println("'" + characters + "' entered " + (((double)letterArray[index] / (double)total) * 100) 
                               + " percent");
        }
    }
}


$ cat abc.txt
a b c

$ java J_
What is the name of the text file? abc.txt
'a' entered 33.33333333333333 percent
'b' entered 33.33333333333333 percent
'c' entered 33.33333333333333 percent
'd' entered 0.0 percent
'e' entered 0.0 percent
'f' entered 0.0 percent
'g' entered 0.0 percent
'h' entered 0.0 percent
'i' entered 0.0 percent
'j' entered 0.0 percent
'k' entered 0.0 percent
'l' entered 0.0 percent
'm' entered 0.0 percent
'n' entered 0.0 percent
'o' entered 0.0 percent
'p' entered 0.0 percent
'q' entered 0.0 percent
'r' entered 0.0 percent
's' entered 0.0 percent
't' entered 0.0 percent
'u' entered 0.0 percent
'v' entered 0.0 percent
'w' entered 0.0 percent
'x' entered 0.0 percent
'y' entered 0.0 percent
'z' entered 0.0 percent

答案 1 :(得分:1)

如果我理解,你所要做的就是将最后一个for循环留在图表之外,所以:

import java.io.File;
import java.util.Scanner;

public class JCountlettersfilereader {
  public static void main(String[] args) throws Exception {
    // open the file
    // Scanner console = new Scanner(System.in);
    // System.out.print("What is the name of the text file? ");
    String fileName = "file.txt";
    Scanner input = new Scanner(new File(fileName));

    // initialize array with 26 elements
    int[] letterArray = new int[26];
    int totalLetters = 0;

    while (input.hasNext()) {
        String next = input.next().toLowerCase();

        // run loop for each line incrementing per character
        for (int i = 0; i < next.length(); i++) {
            char characters = next.charAt(i);

            // ignore all characters which aren't alphabetic
            if (Character.isLetter(characters)) {
                totalLetters++;
                // if character is uppercase then convert to lowercase
                characters = Character.toLowerCase(characters);

                // populate array
                int index = characters - 'a';
                letterArray[index]++;
            }
        }

        int total = 0;
        for (int i = 0; i < letterArray.length; i++) {
            total += letterArray[i];
        }
    }

        for (char characters = 'a'; characters <= 'z'; characters++) {
            int index = characters - 'a';
            // print out the analysis
            System.out
                    .println("'"
                            + characters
                            + "' entered "
                            + (((double) letterArray[index] / (double) totalLetters) * 100)
                            + " percent" +"("+letterArray[index] +" /"+totalLetters+")");
        }

}
}

它返回:

'a'输入42.857142857142854%(3/7) ... '我'输入了14.285714285714285%(1/7) ... 'm'进入28.57142857142857%(2/7) 'n'输入14.285714285714285%(1/7)

这是你期望的吗?

答案 2 :(得分:0)

删除空格的一种方法是:

"i am a man".replaceAll(" ", "");

答案 3 :(得分:0)

移动打印出while循环之外的结果的代码。您只需要运行一次,而不是为文件中的每个单词运行一次。

此外,您不需要在两个不同的行上强制转换为小写。

答案 4 :(得分:0)

使用replaceAll("[\s]", "");

这将删除所有空格(空行,制表符,空格)

答案 5 :(得分:0)

您可以将分隔符设置为\\w,这意味着它不会占用空格

设置

input.setDelimeter("\\w");

在while循环之外