按降序显示文件中前10个最常出现的单词

时间:2011-08-16 08:49:40

标签: java

我正在努力使代码更加整洁有效。我试图实现zamzela的[你会发现其中一个答案]方法。我无法实现比较器

公共类WordCountExample {

public static void main(String[] args) throws IOException {

    Set<WordCount> wordcount = new HashSet<WordCount>();

    File file = new File("c:\\test\\input1.txt");    //path to the file

    String str = FileUtils.readFileToString(file);   // converts a file into a string


    String[] words = str.split("\\s+");     // split the line on whitespace,
                                            // would return an array of words

    for (String s : words) {

        wordcount.add(new WordCount(s));

        WordCount.incCount();

    }

         /*here WordCount is the name of comparator class*/

          Collections.sort(wordcount,new WordCount());   //getting a error here 


    for (WordCount w : wordcount) {

        System.out.println(w.getValue() + " " + w.getCount());
    }

}

}

5 个答案:

答案 0 :(得分:3)

不要将字数统计为地图中的值。存储包含单词及其出现次数的对象。

public class `WordWithOccurrences` {
    private final String word;
    private int occurrences;
    // ...
}

因此,您的地图应为Map<String, WordWithOccurrences>

然后根据其出现属性对值列表进行排序,并迭代最后10个值以显示其word属性(或按相反顺序排序并显示前十个值)。

您必须使用自定义比较器对WordWithOccurrences个实例进行排序。

答案 1 :(得分:2)

我认为最好的方法是制作一个Word Word

    public class Word implements Comparable<Word>{
    private String value;
    private Integer count;

    public Word(String value) {
        this.value = value;
        count = 1;
    }

    public String getValue() {
        return value;
    }

    public Integer getCount() {
        return count;
    }

    public void incCount() {
        count++;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof Word)
            return value.equals(((Word) obj).getValue());
        else
            return false;
    }

    @Override
    public int hashCode() {
        return value.hashCode();
    }

    @Override
    public int compareTo(Word o) {
        return count.compareTo(o.getCount());
    }
}

你可以使用HashSet因为bean将保存在bean中,在你填充完所有内容之后你可以对它进行排序Collections.sort(array);并采取前10个元素。

答案 2 :(得分:1)

终于解决了这个计划。这是一个完美的工作程序,它读取一个文件,计算单词的数量,并按降序列出前10个最常出现的单词

import java.io. ; import java.util。;

public class Occurance {

public static void main(String[] args) throws IOException {         
    LinkedHashMap<String, Integer> wordcount =
            new LinkedHashMap<String, Integer>();
    try { 
        BufferedReader in = new BufferedReader(
                                  new FileReader("c:\\test\\input1.txt"));
        String str;

        while ((str = in.readLine()) != null) { 
            str = str.toLowerCase(); // convert to lower case 
            String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words

            for( String word : words ) {
              if( word.length() == 0 ) {
                continue; 
              }

              Integer occurences = wordcount.get(word);

              if( occurences == null) {
                occurences = 1;
              } else {
                occurences++;
              }

              wordcount.put(word, occurences);
            }

                } 

        } 
    catch(Exception e){
        System.out.println(e);
    }




    ArrayList<Integer> values = new ArrayList<Integer>();
    values.addAll(wordcount.values());

    Collections.sort(values, Collections.reverseOrder());

    int last_i = -1;


    for (Integer i : values.subList(0, 9)) { 
        if (last_i == i) // without duplicates
            continue;
        last_i = i;




            for (String s : wordcount.keySet()) { 

            if (wordcount.get(s) == i) // which have this value  
               System.out.println(s+ " " + i);


    }
    } 

}

答案 3 :(得分:0)

假设你的程序实际上没有工作,这里有一个提示:

你自己在每个角色的基础上进行比较,没有经过那些代码,我打赌是错的:

int idx1 = -1;

for (int i = 0; i < str.length(); i++) { 
  if ((!Character.isLetter(str.charAt(i))) || (i + 1 == str.length())) { 
    if (i - idx1 > 1) { 
       if (Character.isLetter(str.charAt(i))) 
         i++;
       String word = str.substring(idx1 + 1, i);
       if (wordcount.containsKey(word)) { 
          wordcount.put(word, wordcount.get(word) + 1);
       } else { 
          wordcount.put(word, 1);
       } 
     }          
     idx1 = i;
   } 
 } 

尝试使用Java的内置功能:

  String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words

  for( String word : words ) {
    if( word.length() == 0 ) {
      continue; //for empty lines, split would return at least one element which is ""; so account for that
    }

    Integer occurences = wordcount.get(word);

    if( occurences == null) {
      occurences = 1;
    } else {
      occurences++;
    }

    wordcount.put(word, occurences);
  }

答案 4 :(得分:0)

我会看看java.util.Comparator。您可以定义自己的比较器,您可以将其传递给Collections.sort()。在您的情况下,您可以按其计数对wordcount进行排序。最后,只需获取已排序集合的前十项。

如果您的wordcount地图的项目太多,您可能需要更高效的内容。可以在线性时间内完成此操作,方法是保持一个大小为10的有序数组,插入每个键,始终丢弃具有最低计数的键。