Question

我希望能够计算给定文件中每个单词重复的次数。但是，我在这方面遇到了麻烦。我尝试了两种不同的方式。我使用HashMap并将单词作为键，其频率作为关联值。但是，这似乎不起作用，因为哈哈HashMap，您无法访问指定索引处的元素。现在我尝试使用两个单独的数组列表，一个用于单词，一个用于每个单词的出现。我的想法是：在wordsCount arrayList中添加单词时，如果单词已经在wordsCount中，则在已经看到的单词的索引处增加cnt ArrayList中元素的值。但是，我不确定要写什么来增加值

import java.io.*;
import java.lang.reflect.Array;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class MP0 {
    Random generator;
    String delimiters = " \t,;.?!-:@[](){}_*/";
    String[] stopWordsArray = {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
            "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
            "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
            "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
            "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
            "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
            "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
            "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
            "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
            "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"};
    private static String str;
    private static File file;
    private static Scanner s;   

    public MP0() {
    }

    public void process() throws Exception{
        ArrayList<Integer> cnt = new ArrayList<Integer>();
        boolean isStopWord = false;
        StringTokenizer st = new StringTokenizer(s.nextLine(), delimiters);
        ArrayList<String> wordsCount = new ArrayList<String>();

        while(st.hasMoreTokens()) {
            String s = st.nextToken().toLowerCase();
            if(!wordsCount.contains(s)) {
                for(int i = 0; i < stopWordsArray.length; i++) {
                    isStopWord = false;
                    if(s.equals(stopWordsArray[i])) {
                        isStopWord = true;
                        break;
                    }
                }
                if(isStopWord == false) {
                    wordsCount.add(s);
                    cnt.add(1);
                }
            }
            else { // i tried this but only displayed "1" for all words
                cnt.set(wordsCount.indexOf(s), cnt.get(wordsCount.indexOf(s) + 1));
            }
        }


        for(int i = 0; i < wordsCount.size(); i++) {
            System.out.println(wordsCount.get(i) + " " + cnt.get(i));
        }

    }

    public static void main(String args[]) throws Exception {
            try {
                file = new File("input.txt");
                s = new Scanner(file);
                str = s.nextLine();
                String[] topItems;
                MP0 mp = new MP0();
                while(s.hasNext()) {
                    mp.process();
                    str = s.nextLine();
                }
            }
            catch(FileNotFoundException e) {
                System.out.println("File not found");
            }
    }

}

Answer 1

我相信你可以使用hashmap来做你想要的。像这样：

              HashMap<String, Integer> mymap= new HashMap<>();

                for(String word: stopWordsArray) {
                    if (mymap.containsKey(word))
                        mymap.put(word, mymap.get(word) + 1);
                    else{
                        mymap.put(word, new Integer(1));
                    }
                }

修改：在评论中添加了更正

第二次修改 Here是关于如何执行此操作的oracle教程：

这是相同的想法，但它看起来更简洁。以下是相关代码的摘要：

for (String word : stopWordsArray) {
            Integer freq = m.get(word);
            m.put(word, (freq == null) ? 1 : freq + 1);
        }

Answer 2

你也可以使用模式和匹配器。

String in = "our goal is our power";
int i = 0;
Pattern p = Pattern.compile("our");
Matcher m = p.matcher( in );
while (m.find()) {
    i++;
}

Answer 3

我认为Map绝对是表示每个单词计数的方法。在我看来，获取地图的最佳方式（或至少是一种尚未被提及的方式）是将单词放在特定的Stream中。这样，您就可以利用已经在Java标准库中编写的大量代码，使代码更加简洁，并且无需重新发明所有轮子。流可以有一点学习曲线，但一旦你理解，它们就会非常有用。例如，观察你的20+线方法减少到2行：

import java.util.Map;
import java.util.ArrayList;
import java.util.Arrays;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.summingInt;
import static java.util.function.Function.identity;

public class CountWords
{
    private static String delimiters = "[ \t,;.?!\\-:@\\[\\](){}_*/]+";
    private static ArrayList<String> stopWords =    new ArrayList<>(Arrays.asList(new String[] {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
                                                "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
                                                "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
                                                "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
                                                "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
                                                "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
                                                "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
                                                "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
                                                "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
                                                "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"}));
    public static void main(String[] args) throws IOException //Your code should likely catch this
    {
        Path fLoc = Paths.get("test.txt"); //Or get from stdio, args[0], etc...
        CountWords cw = new CountWords();
        Map<String, Integer> counts = cw.count(Files.lines(fLoc).flatMap(s -> Arrays.stream(s.split(delimiters))));
        counts.forEach((k, v) -> System.out.format("Key: %s, Val: %d\n", k, v));
    }

    public Map<String, Integer> count(Stream<String> words)
    {
        return words.filter(s -> !stopWords.contains(s))
                    .collect(groupingBy(identity(), summingInt(s -> 1)));
    }
}

在API中查看这些内容非常容易，但这里的内容可能不太明显：

Files.lines：一个漂亮的小方法，它将获取文件的路径，并返回文件中所有行的Stream。实际上，我们想要一个单词流，这将我们带到下一个操作。
.flatMap：一般来说，映射操作会获取集合中的每个项目，并将其转换为其他项目。 Stream有一个方法，称为map，它将获取每个项目并将其转换为其他项目。但是，在我们的例子中，我们希望将行转换为单词，并且每行可能包含许多单词，因此map不会起作用。输入flatMap：映射操作，然后进行展平操作。通常，展平操作会获取集合中的每个元素，如果子元素本身是集合，则展开集合以使父集合不再包含子集，而是拥有所有子集的子集。作为自己的孩子。如果这听起来很混乱，请听别人解释它比我here更好。在Java的情况下，这意味着我们的映射操作必须返回Stream，并且扁平化将由flatMap方法处理。
等一下，所有这些->业务是什么？很高兴你问。请参阅flatMap是higher-order function - 也就是说，它是一个以另一个函数作为参数的函数。我们可以将函数编写为某个方法（以避免混淆这些术语，因为它们非常相似：方法是附加到对象或类的函数），但是这个特定函数没有逻辑基础可以附加到任何特定对象还有什么，我们不关心重复使用它，所以它甚至不需要名字。只需指定内联函数就容易多了。输入lambda expressions！不过，这个问题并不是关于他们的，所以请阅读链接以了解更多信息。
我们的lambda沿着指定的分隔符（{I}将您的delim字符串转换为splits）每个String和regular expression。这会返回一个数组，但我们需要Stream，因此我们使用便捷方法Arrays.stream来轻松转换。现在每一行将被制作成一个单词流，flatMap将把单独的行展平为文件中所有单词的单个流。虽然我自己想出了这个，但是used in the common usage examples of the API几乎完全相同。
.filter：另一个高阶函数。这个删除了流中不会导致给定函数返回true的所有条目。在问题的示例代码中，你不要计算数组中的所有停用词，所以在这里我使用filter做同样的事情，借助相当方便（并且不言自明）的{{1} （它需要装箱你在List.contains内使用的阵列，但我相信你获得的简洁是值得的。）因此，我们有一个流只保留不会停止单词的单词。
List，.collect等：最后，好的东西。这一条短线基本上完成了您的问题所要求的所有工作。 groupingBy是一种使collect返回单个对象的方法，通常是列表或数组之类的集合对象，因此也就是名称。作为参数，它可以采用Stream，一个知道如何将给定Stream收集到所需对象的对象。我们可以建立自己的，但在这种情况下，它是不必要的;标准图书馆再次为我们完成了工作。我们使用现有的收集器Collector。在最基本的形式中，groupingBy采用单个参数（一个函数;再次，我们有一个高阶函数），称为分类器，它将项目分组。对于此参数，我们提供groupingBy（statically imported来匹配收集器，这些收集器又被静态导入以匹配他们在API中的示例中使用的样式。这个函数只需要接受参数并回复它，例如当你需要一个函数参数但实际上并不想修改输入时（它是等同但丑陋的Function.identity() lambda的替代品。）。我们想要这样做，因为此函数的返回值构成了我们正在收集的地图的键，收集器会自动将所有返回值组合在一起，这些返回值是x -> x在公共密钥下（以及我们所有重复的单词）将彼此.equal。
默认情况下，这将为我们留下一个地图，其中包含单词本身和作为值的键，包含给定单词的每个单独实例的流。我们不想要这个，但幸运的是，有一个.equal重载给我们第二个参数来指定：一个收集器，它将每个groupingBy值变成每个键的单个对象值。由于当前流包含每个单词的所有实例，我们只需要获取每个流的长度并将其用作每个映射键的值。幸运的是，标准库再次使用Stream收集器，它汇总了流中每个项目的summingInt表示。在这里，我们可以指定一个函数，为每个项目返回不同的int（例如，如果我们计算的是总字母而不是单词，则表达式为int），但我们不会＃39} ; t想要，所以我们忽略了使用提供给我们的s -> s.length()变量，并且不断地用s返回1，确保为每个单词的实例添加1。

s -> 1

TL; DR ：我们使用内置方法简洁地过滤掉停用词，然后将剩余的单词分组为带有单词作为键的地图，并计数用作值的那些单词的实例数，全部分为2行。

计算字符串

3 个答案: