Question

I'm trying to find number of words in given string. Below is sequential algorithm for it which works fine.

public int getWordcount() {

        boolean lastSpace = true;
        int result = 0;

        for(char c : str.toCharArray()){
            if(Character.isWhitespace(c)){
                lastSpace = true;
            }else{
                if(lastSpace){
                    lastSpace = false;
                    ++result;
                }
            }
        }

        return result;

    }

But, when i tried to 'parallelize' this with Stream.collect(supplier, accumulator, combiner) method, i am getting wordCount = 0. I am using an immutable class (WordCountState) just to maintain the state of word count.

Code :

public class WordCounter {
    private final String str = "Java8 parallelism  helps    if you know how to use it properly.";

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));

        WordCountState finalState = charStream.parallel()                                             
                                              .collect(WordCountState::new,
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }
}

public class WordCountState {
    private final boolean lastSpace;
    private final int counter;
    private static int numberOfInstances = 0;

public WordCountState(){
        this.lastSpace = true;
        this.counter = 0;
        //numberOfInstances++;
    }

    public WordCountState(boolean lastSpace, int counter){
        this.lastSpace = lastSpace;
        this.counter = counter;
        //numberOfInstances++;
    }

//accumulator
    public WordCountState accumulate(Character c) {


        if(Character.isWhitespace(c)){
            return lastSpace ? this : new WordCountState(true, counter);
        }else{
            return lastSpace ? new WordCountState(false, counter + 1) : this;
        }   
    }

    //combiner
    public WordCountState combine(WordCountState wordCountState) {  
        //System.out.println("Returning new obj with count : " + (counter + wordCountState.getCounter()));
        return new WordCountState(this.isLastSpace(), 
                                    (counter + wordCountState.getCounter()));
    }

I've observed two issues with above code : 1. Number of objects (WordCountState) created are greater than number of characters in the string. 2. Result is always 0. 3. As per accumulator/consumer documentation, shouldn't the accumulator return void? Even though my accumulator method is returning an object, compiler doesn't complain.

Any clue where i might have gone off track?

UPDATE : Used solution as below -

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));


        WordCountState finalState = charStream.parallel()
                                              .reduce(new WordCountState(),
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }

Answer 1

您始终可以调用方法并忽略其返回值，因此在使用方法引用时允许相同的方法是合乎逻辑的。因此，只要参数匹配，当需要使用者时，创建非void方法的方法引用是没有问题的。

您使用不可变WordCountState类创建的内容是 reduction 操作，即它支持像

这样的用例

Stream<Character> charStream = IntStream.range(0, str.length())
                                        .mapToObj(i -> str.charAt(i));

WordCountState finalState = charStream.parallel()
        .map(ch -> new WordCountState().accumulate(ch))
        .reduce(new WordCountState(), WordCountState::combine);

而collect方法支持 mutable reduction ，其中容器实例（可能与结果相同）被修改。

您的解决方案中仍然存在逻辑错误，因为每个WordCountState实例都假设具有前面的空格字符，而不知道实际情况并且没有尝试在组合器中修复此问题。

修复和简化此方法的方法仍然是：

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .mapToObj(WordCountState::new)
        .reduce(WordCountState::new)
        .map(WordCountState::getResult).orElse(0);
}


public class WordCountState {
    private final boolean firstSpace, lastSpace;
    private final int counter;

    public WordCountState(int character){
        firstSpace = lastSpace = Character.isWhitespace(character);
        this.counter = 0;
    }

    public WordCountState(WordCountState a, WordCountState b) {
        this.firstSpace = a.firstSpace;
        this.lastSpace = b.lastSpace;
        this.counter = a.counter + b.counter + (a.lastSpace && !b.firstSpace? 1: 0);
    }
    public int getResult() {
        return counter+(firstSpace? 0: 1);
    }
}

如果您担心WordCountState个实例的数量，请注意与您的初始方法相比，此解决方案无法创建的Character个实例数。

但是，如果您将WordCountState重写为可变结果容器，则此任务确实适合于可变缩减：

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .collect(WordCountState::new, WordCountState::accumulate, WordCountState::combine)
        .getResult();
}


public class WordCountState {
    private boolean firstSpace, lastSpace=true, initial=true;
    private int counter;

    public void accumulate(int character) {
        boolean white=Character.isWhitespace(character);
        if(lastSpace && !white) counter++;
        lastSpace=white;
        if(initial) {
            firstSpace=white;
            initial=false;
        }
    }
    public void combine(WordCountState b) {
        if(initial) {
            this.initial=b.initial;
            this.counter=b.counter;
            this.firstSpace=b.firstSpace;
            this.lastSpace=b.lastSpace;
        }
        else if(!b.initial) {
            this.counter += b.counter;
            if(!lastSpace && !b.firstSpace) counter--;
            this.lastSpace = b.lastSpace;
        }
    }
    public int getResult() {
        return counter;
    }
}

注意如何使用int一致地表示unicode字符，允许使用codePoint()的{{1}}流，这不仅更简单，而且还处理基本多语言平面之外的字符并且可能更有效率，因为它不需要装箱到CharSequence个实例。

Answer 2

当您实施stream().collect(supplier, accumulator, combiner)时，它们会返回void（组合器和累加器）。问题是：

  collect(WordCountState::new,
          WordCountState::accumulate,
          WordCountState::combine)

在你的情况下，实际意味着（只是累加器，但对于组合器也是如此）：

     (wordCounter, character) -> {
              WordCountState state = wc.accumulate(c);
              return;
     }

确实这不是微不足道的。我们假设我们有两种方法：

public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

public WordCountState accumulate2(Character c) {
    if (Character.isWhitespace(c)) {
        return lastSpace ? this : new WordCountState(true, counter);
    } else {
        return lastSpace ? new WordCountState(false, counter + 1) : this;
    }
}

对于他们来说，下面的代码可以正常工作，但仅用于方法参考，而不是用于lambda表达式。

BiConsumer<WordCountState, Character> cons = WordCountState::accumulate;

BiConsumer<WordCountState, Character> cons2 = WordCountState::accumulate2;

你可以通过一个implementes BiConsumer类来表示它略有不同，例如：

 BiConsumer<WordCountState, Character> clazz = new BiConsumer<WordCountState, Character>() {
        @Override
        public void accept(WordCountState state, Character character) {
            WordCountState newState = state.accumulate2(character);
            return;
        }
    };

因此，您的combine和accumulate方法需要更改为：

public void combine(WordCountState wordCountState) {
    counter = counter + wordCountState.getCounter();
}


public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

Answer 3

首先，使用像input.split("\\s+").length这样的词来计算字数会不会更容易吗？

如果这是溪流和收藏家的练习，我们来讨论你的实施。你已经指出了最大的错误：你的累加器和组合器不应该返回新的实例。 collect的签名告诉您它期望BiConsumer，它不会返回任何内容。因为您在累加器中创建新对象，所以永远不会增加收集器实际使用的WordCountState个对象的数量。通过在合并器中创建一个新对象，您可以放弃您可能取得的任何进展。这也是为什么你创建的对象多于输入中的字符：每个字符一个，然后返回一些字符。

见这个改编的实施：

public static class WordCountState
{
    private boolean lastSpace = true;
    private int     counter   = 0;

    public void accumulate(Character character)
    {
        if (!Character.isWhitespace(character))
        {
            if (lastSpace)
            {
                counter++;
            }
            lastSpace = false;
        }
        else
        {
            lastSpace = true;
        }
    }

    public void combine(WordCountState wordCountState)
    {
        counter += wordCountState.counter;
    }
}

在这里，我们不会在每个步骤中创建新对象，而是更改我们拥有的对象的状态。我认为您尝试创建新对象，因为您的Elvis操作员强迫您返回某些内容和/或您无法更改实例字段，因为它们是最终的。但是，它们不需要是最终的，您可以轻松地更改它们。

顺序运行这个改编的实现现在工作正常，因为我们很好地逐个查看字符，最后得到11个单词。

同时，它失败了。它似乎为每个字符创建一个新的WordCountState，但不计算所有字符，并最终在29（至少对我来说）。这显示了算法的一个基本缺陷：拆分每个字符并不起作用。想象一下输入abc abc，它应该导致2.如果你并行执行并且没有指定如何拆分输入，你最终可能会得到这些块：ab, c a, bc，这将加起来4。

问题在于，通过在字符之间进行并行化（即在单词的中间），可以使单独的WordCountState彼此依赖（因为他们需要知道哪一个在它们之前，以及它是否已经结束用空白字符）。这会破坏并行性并导致错误。

除此之外，实现Collector接口可能更容易，而不是提供三种方法：

public static class WordCountCollector
    implements Collector<Character, SimpleEntry<AtomicInteger, Boolean>, Integer>
{
    @Override
    public Supplier<SimpleEntry<AtomicInteger, Boolean>> supplier()
    {
        return () -> new SimpleEntry<>(new AtomicInteger(0), true);
    }

    @Override
    public BiConsumer<SimpleEntry<AtomicInteger, Boolean>, Character> accumulator()
    {
        return (count, character) -> {
            if (!Character.isWhitespace(character))
            {
                if (count.getValue())
                {
                    String before = count.getKey().get() + " -> ";
                    count.getKey().incrementAndGet();
                    System.out.println(before + count.getKey().get());
                }
                count.setValue(false);
            }
            else
            {
                count.setValue(true);
            }
        };
    }

    @Override
    public BinaryOperator<SimpleEntry<AtomicInteger, Boolean>> combiner()
    {
        return (c1, c2) -> new SimpleEntry<>(new AtomicInteger(c1.getKey().get() + c2.getKey().get()), false);
    }

    @Override
    public Function<SimpleEntry<AtomicInteger, Boolean>, Integer> finisher()
    {
        return count -> count.getKey().get();
    }

    @Override
    public Set<java.util.stream.Collector.Characteristics> characteristics()
    {
        return new HashSet<>(Arrays.asList(Characteristics.CONCURRENT, Characteristics.UNORDERED));
    }
}

我们使用一对（SimpleEntry）来保持计数和最后一个空间的知识。这样，我们不需要在收集器本身中实现状态或为它编写param对象。您可以像这样使用此收集器：

return charStream.parallel().collect(new WordCountCollector());

此收集器并行化比初始实现更好，但结果仍然不同（大多数在14到16之间），因为你所采用的方法存在缺点。

using java streams in parallel with collect(supplier, accumulator, combiner) not giving expected results

3 个答案: