我希望能够计算给定文件中每个单词重复的次数。但是,我在这方面遇到了麻烦。我尝试了两种不同的方式。我使用HashMap并将单词作为键,其频率作为关联值。但是,这似乎不起作用,因为哈哈HashMap,您无法访问指定索引处的元素。现在我尝试使用两个单独的数组列表,一个用于单词,一个用于每个单词的出现。我的想法是:在wordsCount arrayList中添加单词时,如果单词已经在wordsCount中,则在已经看到的单词的索引处增加cnt ArrayList中元素的值。但是,我不确定要写什么来增加值
import java.io.*;
import java.lang.reflect.Array;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
public class MP0 {
Random generator;
String delimiters = " \t,;.?!-:@[](){}_*/";
String[] stopWordsArray = {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
"itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
"these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
"do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"};
private static String str;
private static File file;
private static Scanner s;
public MP0() {
}
public void process() throws Exception{
ArrayList<Integer> cnt = new ArrayList<Integer>();
boolean isStopWord = false;
StringTokenizer st = new StringTokenizer(s.nextLine(), delimiters);
ArrayList<String> wordsCount = new ArrayList<String>();
while(st.hasMoreTokens()) {
String s = st.nextToken().toLowerCase();
if(!wordsCount.contains(s)) {
for(int i = 0; i < stopWordsArray.length; i++) {
isStopWord = false;
if(s.equals(stopWordsArray[i])) {
isStopWord = true;
break;
}
}
if(isStopWord == false) {
wordsCount.add(s);
cnt.add(1);
}
}
else { // i tried this but only displayed "1" for all words
cnt.set(wordsCount.indexOf(s), cnt.get(wordsCount.indexOf(s) + 1));
}
}
for(int i = 0; i < wordsCount.size(); i++) {
System.out.println(wordsCount.get(i) + " " + cnt.get(i));
}
}
public static void main(String args[]) throws Exception {
try {
file = new File("input.txt");
s = new Scanner(file);
str = s.nextLine();
String[] topItems;
MP0 mp = new MP0();
while(s.hasNext()) {
mp.process();
str = s.nextLine();
}
}
catch(FileNotFoundException e) {
System.out.println("File not found");
}
}
}
答案 0 :(得分:3)
我相信你可以使用hashmap来做你想要的。像这样:
HashMap<String, Integer> mymap= new HashMap<>();
for(String word: stopWordsArray) {
if (mymap.containsKey(word))
mymap.put(word, mymap.get(word) + 1);
else{
mymap.put(word, new Integer(1));
}
}
修改:在评论中添加了更正
第二次修改 Here是关于如何执行此操作的oracle教程:
这是相同的想法,但它看起来更简洁。以下是相关代码的摘要:
for (String word : stopWordsArray) {
Integer freq = m.get(word);
m.put(word, (freq == null) ? 1 : freq + 1);
}
答案 1 :(得分:0)
你也可以使用模式和匹配器。
String in = "our goal is our power";
int i = 0;
Pattern p = Pattern.compile("our");
Matcher m = p.matcher( in );
while (m.find()) {
i++;
}
答案 2 :(得分:0)
我认为Map绝对是表示每个单词计数的方法。在我看来,获取地图的最佳方式(或至少是一种尚未被提及的方式)是将单词放在特定的Stream中。这样,您就可以利用已经在Java标准库中编写的大量代码,使代码更加简洁,并且无需重新发明所有轮子。流可以有一点学习曲线,但一旦你理解,它们就会非常有用。例如,观察你的20+线方法减少到2行:
import java.util.Map;
import java.util.ArrayList;
import java.util.Arrays;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.summingInt;
import static java.util.function.Function.identity;
public class CountWords
{
private static String delimiters = "[ \t,;.?!\\-:@\\[\\](){}_*/]+";
private static ArrayList<String> stopWords = new ArrayList<>(Arrays.asList(new String[] {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
"itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
"these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
"do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"}));
public static void main(String[] args) throws IOException //Your code should likely catch this
{
Path fLoc = Paths.get("test.txt"); //Or get from stdio, args[0], etc...
CountWords cw = new CountWords();
Map<String, Integer> counts = cw.count(Files.lines(fLoc).flatMap(s -> Arrays.stream(s.split(delimiters))));
counts.forEach((k, v) -> System.out.format("Key: %s, Val: %d\n", k, v));
}
public Map<String, Integer> count(Stream<String> words)
{
return words.filter(s -> !stopWords.contains(s))
.collect(groupingBy(identity(), summingInt(s -> 1)));
}
}
在API中查看这些内容非常容易,但这里的内容可能不太明显:
Files.lines
:一个漂亮的小方法,它将获取文件的路径,并返回文件中所有行的Stream
。实际上,我们想要一个单词流,这将我们带到下一个操作。.flatMap
:一般来说,映射操作会获取集合中的每个项目,并将其转换为其他项目。 Stream
有一个方法,称为map
,它将获取每个项目并将其转换为其他项目。但是,在我们的例子中,我们希望将行转换为单词,并且每行可能包含许多单词,因此map
不会起作用。输入flatMap
:映射操作,然后进行展平操作。通常,展平操作会获取集合中的每个元素,如果子元素本身是集合,则展开集合以使父集合不再包含子集,而是拥有所有子集的子集。作为自己的孩子。如果这听起来很混乱,请听别人解释它比我here更好。在Java的情况下,这意味着我们的映射操作必须返回Stream
,并且扁平化将由flatMap
方法处理。->
业务是什么?很高兴你问。请参阅flatMap
是higher-order function - 也就是说,它是一个以另一个函数作为参数的函数。我们可以将函数编写为某个方法(以避免混淆这些术语,因为它们非常相似:方法是附加到对象或类的函数),但是这个特定函数没有逻辑基础可以附加到任何特定对象还有什么,我们不关心重复使用它,所以它甚至不需要名字。只需指定内联函数就容易多了。输入lambda expressions!不过,这个问题并不是关于他们的,所以请阅读链接以了解更多信息。String
和regular expression。这会返回一个数组,但我们需要Stream
,因此我们使用便捷方法Arrays.stream
来轻松转换。现在每一行将被制作成一个单词流,flatMap
将把单独的行展平为文件中所有单词的单个流。虽然我自己想出了这个,但是used in the common usage examples of the API几乎完全相同。.filter
:另一个高阶函数。这个删除了流中不会导致给定函数返回true
的所有条目。在问题的示例代码中,你不要计算数组中的所有停用词,所以在这里我使用filter
做同样的事情,借助相当方便(并且不言自明)的{{1} (它需要装箱你在List.contains
内使用的阵列,但我相信你获得的简洁是值得的。)因此,我们有一个流只保留不会停止单词的单词。List
,.collect
等:最后,好的东西。这一条短线基本上完成了您的问题所要求的所有工作。 groupingBy
是一种使collect
返回单个对象的方法,通常是列表或数组之类的集合对象,因此也就是名称。作为参数,它可以采用Stream
,一个知道如何将给定Stream收集到所需对象的对象。我们可以建立自己的,但在这种情况下,它是不必要的;标准图书馆再次为我们完成了工作。我们使用现有的收集器Collector
。在最基本的形式中,groupingBy
采用单个参数(一个函数;再次,我们有一个高阶函数),称为分类器,它将项目分组。对于此参数,我们提供groupingBy
(statically imported来匹配收集器,这些收集器又被静态导入以匹配他们在API中的示例中使用的样式。这个函数只需要接受参数并回复它,例如当你需要一个函数参数但实际上并不想修改输入时(它是等同但丑陋的Function.identity()
lambda的替代品。 )。我们想要这样做,因为此函数的返回值构成了我们正在收集的地图的键,收集器会自动将所有返回值组合在一起,这些返回值是x -> x
在公共密钥下(以及我们所有重复的单词)将彼此.equal
。.equal
重载给我们第二个参数来指定:一个收集器,它将每个groupingBy
值变成每个键的单个对象值。由于当前流包含每个单词的所有实例,我们只需要获取每个流的长度并将其用作每个映射键的值。幸运的是,标准库再次使用Stream
收集器,它汇总了流中每个项目的summingInt
表示。在这里,我们可以指定一个函数,为每个项目返回不同的int
(例如,如果我们计算的是总字母而不是单词,则表达式为int
),但我们不会#39} ; t想要,所以我们忽略了使用提供给我们的s -> s.length()
变量,并且不断地用s
返回1,确保为每个单词的实例添加1。s -> 1
方法上的