Question

我正在一个Java应用程序上工作，该应用程序接收2个文本文件以及其他参数，为它们创建哈希图，并对它们执行一些比较方法。一种方法是打印每个文件共享的所有唯一单词，然后打印这些单词，然后计算两个文件的Jaccard Index。我还希望这种方法还可以打印每个文件中每个单词的出现次数，而我想知道这样做的最佳方法是什么。我在这里浏览了许多其他示例，但找不到答案。

下面是我当前使用的方法的一部分。这两个哈希图仅包含唯一的单词，并计数与每个单词相关的频率。它提供了每个文件的共同词，但我也想看看每个文件中每个词的使用频率。

public double compareMaps(HashMap<String,Integer> hMap1,HashMap<String,Integer> hMap2){

    Set<String> mapSet1 = new TreeSet<>(hMap1.keySet());
    Set<String> mapSet2 = new TreeSet<>(hMap2.keySet());

    Set<String> Intersect = new TreeSet<>(mapSet1);
    Intersect.retainAll(mapSet2);


    Set<String> union = new TreeSet<>(mapSet1);

    union.addAll(mapSet2);

    Iterator iterator;
    iterator = Intersect.iterator();
    System.out.printf("%nUnique words in Document 1: %d%nUnique words in Document 2: %d%n", hMap1.size(), hMap2.size());

    System.out.println("Word\t\tCount1\t\tCount2");
    while (iterator.hasNext()){
        System.out.println(iterator.next());

我当前的输出。
文档1：91中的唯一词
文档2：122中的唯一词
字数1数2
a
也
一个
和
我想要的东西：
文档1：91中的唯一词
文档2：122中的唯一词
字数1数2
a 4 7
还有3 3
一个5 4
和3 6

在此先感谢您的帮助！

Answer 1

您的计数位于传入的原始地图中，因此您需要从那里获取它们：

while (iterator.hasNext()) {
  String word = iterator.next();
  System.out.println(word + "\t" + Integer.toString(hMap1.get(word)) + "\t" + Integer.toString(hMap2.get(word)));
}

Answer 2

为了从文件中获取每个作品的出现，可以使用以下代码：

//spit pattern sentences to words
static final Pattern SPLIT = Pattern.compile("[- .:,]+");

//read the file with Buffered reader. 
BufferedReader reader =  Files.newBufferedReader(
            Paths.get("<add_here_the_filename>), StandardCharsets.UTF_8);

//solution one - using group
Map<String, Map<Integer, List<String>>> solution_1 =
        reader.lines()
              .flatMap(line -> SPLIT_PATTERN.splitAsStream(line))
              .collect(Collectors.groupingBy(word -> word.substring(0,1),
                       Collectors.groupingBy(String::length)));

或者，您可以使用toMap（）创建带有每个单词出现的地图。

从集合和哈希图打印

2 个答案: