Question

问题 - 将1000个哈希映射合并到一个映射中。假设每个hashmap包含1页书的字母和频率，并且书有1000页。所以我们扫描了每个页面并创建了1000个哈希图，现在我们想要减少/合并它们。这必须利用多线程来完成。注意 - 我们不使用hadoop，因为这必须在单台机器上完成。这个问题是为了让我的疑问得到解决而量身定做的，请不要回答暗示通过线程的答案。

这是已知解决方案的典型问题吗？如果是，请指出任何参考链接。
如果没有，那么如果线程不返回值，那么如何去减少合并问题呢？这是一种建议的方法。以分治方式工作。首先生成500个线程，每个线程组合2个映射，然后生成250个线程，每个线程组合2个合并的映射......依此类推。任何反对意见？更好的想法？

Answer 1

如果您可以使用Java 8，那么您可以使用并行流来并行完成工作：

List<Map<String, Integer>> maps = new ArrayList<>();
//populate: one map per page

Map<String, Integer> summary = maps.parallelStream()
        .flatMap(m -> m.entrySet().stream())
        .collect(toMap(Entry::getKey, Entry::getValue, (i1, i2) -> i1 + i2));

使用Java＆lt; 8你需要自己进行并行化，例如使用Fork / Join框架（parallelStream在幕后做什么）或ExecutorService。

在任何情况下，对于CPU密集型任务，产生的线程数多于机器上的处理器数量都会适得其反，因此除非你运行500核的野兽，否则不要启动500个线程。

完整示例：

public static void main(String[] args) {
  List<Map<String, Integer>> maps = new ArrayList<>();

  maps.add(map("a cat and a dog and a cat and a dog"));
  maps.add(map("a hat and a man and a man and a cat"));
  maps.add(map("a cat and a dog and a cat and a dog"));
  maps.add(map("a hat and a man and a man and a cat"));
  maps.add(map("a cat and a dog and a cat and a dog"));
  maps.add(map("a hat and a man and a man and a cat"));

  System.out.println(maps);

  Map<String, Integer> summary = maps.parallelStream()
              .flatMap(m -> m.entrySet().stream())
              //what thread are we on?
              .peek(e -> System.out.println(Thread.currentThread()))
              .collect(toMap(Entry::getKey, Entry::getValue, (i1, i2) -> i1 + i2));

  System.out.println("summary = " + summary);
}
private static Map<String, Integer> map(String text) {
  Map<String, Integer> map = new HashMap<>();
  for (String s : text.split("\\s+")) {
    Integer count = map.getOrDefault(s, 0) + 1;
    map.put(s, count);
  }
  return map;
}

Answer 2

由于合并排序基于分而行的策略，您可以执行以下操作：

合并1000个HM可分为：

A：合并前500;
B：合并第二个500;
合并A和B;

A和B都可以分为250个HM的子部分进行合并等等。

现在，当我们完成并行合并的基本思想时，让我们做一些调整：

从一些HM的数量开始，比方说8，并行运行合并是没有意义的 - 你应该在一个线程中进行。
您可以使用ThreadPoolExecutor将合并任务分配给固定数量的线程。

就是这样！

Answer 3

我建议使用特定队列来合并地图。启动任意数量的工作线程，执行以下操作：

轮询队列并获取第一张地图。然后所有下一张地图都会合并到此。
轮询队列并获取下一张地图。如果队列不为空，请将第一个映射与下一个映射合并。如果队列为空，则将第一个映射（已经增长）放回队列（作为与poll操作的原子同步操作）并退出。支持反映剩余线程数的整数值。如果这是最后一个线程，则所有工作都已完成，而不是放回，将结果映射传递给下一个处理步骤。

完整代码示例请参阅https://github.com/rfqu/CodeSamples/blob/master/src/reduceMaps/ReduceMaps.java

多线程合并，推荐最好的java线程实践

3 个答案: