Question

我正在尝试过滤/减少其中包含重复条目的数据流。

从本质上讲，我试图找到一种比我实现的更好的过滤数据集的解决方案。我们的数据在其基础上是这样的：

Action | Date         | Detail
15     | 2016-03-15   | 
5      | 2016-03-15   | D1
5      | 2016-09-25   | D2      <--
5      | 2016-09-25   | D3      <-- same day, different detail
4      | 2017-02-08   | D4
4      | 2017-02-08   | D5
5      | 2017-03-01   | D6      <--
5      | 2017-03-05   | D6      <-- different day, same detail; need earliest
5      | 2017-03-08   | D7
5      | 2017-03-10   | D8
...

我需要提取以下内容：

仅选择了操作5
如果细节相同（例如，D6在不同日期出现两次），则选择最早的日期

这些数据被加载到对象中（每个“记录”一个实例），并且对象上还有其他字段，但它们与此过滤无关。详细信息存储为字符串，日期显示为ZonedDateTime，操作是int（实际上是enum，但此处显示为int）。对象按时间顺序以List<Entry>给出。

我能够开始工作，但我认为这不是最理想的解决办法：

  List<Entry> entries = getEntries(); // retrieved from a server

  final Set<String> update = new HashSet<>();
  List<Entry> updates =
  entries.stream()
    .filter(e -> e.getType() == 5)
    .filter(e -> pass(e, update))
    .collect(Collectors.toList());


private boolean pass(Entry ehe, Set<String> update)
   {
     final String val =  ehe.getDetail();
     if (update.contains(val)) { return false; }
     update.add(val);
     return true;
   }

但问题是我必须使用这个pass()方法并在其中检查Set<String>以维护是否已经处理了给定的详细信息。虽然这种方法有效，但似乎应该可以避免外部参考。

我尝试在详细信息上使用groupingBy，它将允许从列表中提取最早的条目，问题是我不再有日期排序，我必须处理结果{{1} }。

似乎有些减少操作（如果我正确地使用了这个术语）这里没有使用Map<String,List<Entry>>方法应该是可能的，但我很难获得更好的实现。

什么是更好的方法，以便pass()可以删除？

谢谢！

Answer 1

在这个答案中有两个解决方案，第二个解决方案明显更快。

解决方案1

the answer Ole V.V.对另一个问题的修改：

Collection<Entry> result = 
 entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.groupingBy(Entry::getDetail, Collectors.collectingAndThen(Collectors.minBy(Comparator.comparing(Entry::getDate)), Optional::get)))
  .values();

使用您的示例数据集（我选择GMT + 0作为时区）：

Entry [action=5, date=2017-03-01T00:00Z[GMT], detail=D6]
Entry [action=5, date=2017-03-08T00:00Z[GMT], detail=D7]
Entry [action=5, date=2017-03-10T00:00Z[GMT], detail=D8]
Entry [action=5, date=2016-03-15T00:00Z[GMT], detail=D1]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D2]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D3]

如果你坚持要List回来：

List<Entry> result = new ArrayList<>(entries.stream() ..... .values());

如果您想恢复原始订单，请使用3参数groupingBy：

...groupingBy(Entry::getDetail, LinkedHashMap::new, Collectors.collectingAndThen(...))

解决方案2

使用toMap，更容易阅读，速度更快（请参阅此答案的holi-java评论，以及下一个＆＃39;部分＆＃39;）：

List<Entry> col = new ArrayList<>(
  entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a))
  .values());

其中(a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a可以替换为：

BinaryOperator.minBy(Comparator.comparing(Entry::getDate))

如果您想在此解决方案中获取原始订单，请使用4参数toMap：

...toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a, LinkedHashMap::new)

效果

使用我为测试我的解决方案而创建的testdata，我检查了两个解决方案的运行时间。第一个解决方案平均需要67毫秒（仅运行20次，所以不要相信数字！），第二个解决方案平均需要2毫秒。如果有人想要进行适当的性能比较，请将结果放在评论中，我会在此处添加。

Answer 2

如果我理解正确......

 List<Entry> result = list.stream().collect(Collectors.toMap(
            Entry::getDetail,
            Function.identity(),
            (left, right) -> {
                return left.getDate().compareTo(right.getDate()) > 0 ? right : left;
            }, LinkedHashMap::new))
            .values()
            .stream()
            .filter(e -> e.getAction() == 5)
            .collect(Collectors.toList());

Answer 3

流接口为此提供了distinct方法。它将基于equals()来整理重复项。

因此，一个选项是相应地实现Entry的{{1}} *方法，或者另一个选项是定义一个Wrapper类，它根据特定的标准检查相等性（即{{ 1}}）

equals

而不是包装，区分和取消映射您的实体：

getDetail()

如果订购了流，则始终使用第一个匹配项。列表流始终是有序的。

*）我首先尝试了它而没有实现hashCode（），但是失败了。原因是，class Wrapper { final Entity entity; Wrapper(Entity entity){ this.entity = entity; } Entity getEntity(){ return this.entity; } public boolean equals(Object o){ if(o instanceof Entity) { return entity.getDetail().equals(((Wrapper) o).getEntity().getDetail()); } return false; } public int hashCode() { return entity != null ? entity.getDetail().hashCode() : 0; } }的内部使用entries.stream() .map(Wrapper::new) .distinct() .map(Wrapper::getEntity) .collect(Collectors.toList());来跟踪已经处理过的元素，并检查java.util.stream.DistinctOps，这也依赖于HashSet作为contains方法。所以仅仅实施hashCode是不够的。

Answer 4

您可以使用LinkedHashMap创建groupingBy，与HashMap不同，它会保留广告订单。您说该列表已按时间顺序排列，因此保留订单就足够了。然后，可以直接汇总此映射值中的列表。例如（添加静态导入）：

List<Entry> selected = objs.stream()
        .filter(e -> e.getType() == 5)
        .collect(groupingBy(Entry::getDetail, LinkedHashMap::new, reducing((a, b) -> a)))
        .values().stream()
        .filter(Optional::isPresent)
        .map(Optional::get)
        .collect(toList());

reducing部分将保留1个或多个出现的第一个。以下是LinkedHashMap的文档以及我正在使用的具体groupingBy。

流复制/减少重复的条目

4 个答案:

解决方案1

解决方案2

效果

流复制/减少重复的条目

4 个答案:

解决方案1 ​​

解决方案2

效果

解决方案1