using java streams in parallel with collect(supplier, accumulator, combiner) not giving expected results

时间:2017-05-29 01:53:59

标签: java parallel-processing java-8 java-stream

I'm trying to find number of words in given string. Below is sequential algorithm for it which works fine.

public int getWordcount() {

        boolean lastSpace = true;
        int result = 0;

        for(char c : str.toCharArray()){
            if(Character.isWhitespace(c)){
                lastSpace = true;
            }else{
                if(lastSpace){
                    lastSpace = false;
                    ++result;
                }
            }
        }

        return result;

    }

But, when i tried to 'parallelize' this with Stream.collect(supplier, accumulator, combiner) method, i am getting wordCount = 0. I am using an immutable class (WordCountState) just to maintain the state of word count.

Code :

public class WordCounter {
    private final String str = "Java8 parallelism  helps    if you know how to use it properly.";

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));

        WordCountState finalState = charStream.parallel()                                             
                                              .collect(WordCountState::new,
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }
}

public class WordCountState {
    private final boolean lastSpace;
    private final int counter;
    private static int numberOfInstances = 0;

public WordCountState(){
        this.lastSpace = true;
        this.counter = 0;
        //numberOfInstances++;
    }

    public WordCountState(boolean lastSpace, int counter){
        this.lastSpace = lastSpace;
        this.counter = counter;
        //numberOfInstances++;
    }

//accumulator
    public WordCountState accumulate(Character c) {


        if(Character.isWhitespace(c)){
            return lastSpace ? this : new WordCountState(true, counter);
        }else{
            return lastSpace ? new WordCountState(false, counter + 1) : this;
        }   
    }

    //combiner
    public WordCountState combine(WordCountState wordCountState) {  
        //System.out.println("Returning new obj with count : " + (counter + wordCountState.getCounter()));
        return new WordCountState(this.isLastSpace(), 
                                    (counter + wordCountState.getCounter()));
    }

I've observed two issues with above code : 1. Number of objects (WordCountState) created are greater than number of characters in the string. 2. Result is always 0. 3. As per accumulator/consumer documentation, shouldn't the accumulator return void? Even though my accumulator method is returning an object, compiler doesn't complain.

Any clue where i might have gone off track?

UPDATE : Used solution as below -

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));


        WordCountState finalState = charStream.parallel()
                                              .reduce(new WordCountState(),
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }

3 个答案:

答案 0 :(得分:4)

您始终可以调用方法并忽略其返回值,因此在使用方法引用时允许相同的方法是合乎逻辑的。因此,只要参数匹配,当需要使用者时,创建非void方法的方法引用是没有问题的。

您使用不可变WordCountState类创建的内容是 reduction 操作,即它支持像

这样的用例
Stream<Character> charStream = IntStream.range(0, str.length())
                                        .mapToObj(i -> str.charAt(i));

WordCountState finalState = charStream.parallel()
        .map(ch -> new WordCountState().accumulate(ch))
        .reduce(new WordCountState(), WordCountState::combine);

collect方法支持 mutable reduction ,其中容器实例(可能与结果相同)被修改。

您的解决方案中仍然存在逻辑错误,因为每个WordCountState实例都假设具有前面的空格字符,而不知道实际情况并且没有尝试在组合器中修复此问题。

修复和简化此方法的方法仍然是:

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .mapToObj(WordCountState::new)
        .reduce(WordCountState::new)
        .map(WordCountState::getResult).orElse(0);
}


public class WordCountState {
    private final boolean firstSpace, lastSpace;
    private final int counter;

    public WordCountState(int character){
        firstSpace = lastSpace = Character.isWhitespace(character);
        this.counter = 0;
    }

    public WordCountState(WordCountState a, WordCountState b) {
        this.firstSpace = a.firstSpace;
        this.lastSpace = b.lastSpace;
        this.counter = a.counter + b.counter + (a.lastSpace && !b.firstSpace? 1: 0);
    }
    public int getResult() {
        return counter+(firstSpace? 0: 1);
    }
}

如果您担心WordCountState个实例的数量,请注意与您的初始方法相比,此解决方案无法创建的Character个实例数。

但是,如果您将WordCountState重写为可变结果容器,则此任务确实适合于可变缩减:

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .collect(WordCountState::new, WordCountState::accumulate, WordCountState::combine)
        .getResult();
}


public class WordCountState {
    private boolean firstSpace, lastSpace=true, initial=true;
    private int counter;

    public void accumulate(int character) {
        boolean white=Character.isWhitespace(character);
        if(lastSpace && !white) counter++;
        lastSpace=white;
        if(initial) {
            firstSpace=white;
            initial=false;
        }
    }
    public void combine(WordCountState b) {
        if(initial) {
            this.initial=b.initial;
            this.counter=b.counter;
            this.firstSpace=b.firstSpace;
            this.lastSpace=b.lastSpace;
        }
        else if(!b.initial) {
            this.counter += b.counter;
            if(!lastSpace && !b.firstSpace) counter--;
            this.lastSpace = b.lastSpace;
        }
    }
    public int getResult() {
        return counter;
    }
}

注意如何使用int一致地表示unicode字符,允许使用codePoint()的{​​{1}}流,这不仅更简单,而且还处理基本多语言平面之外的字符并且可能更有效率,因为它不需要装箱到CharSequence个实例。

答案 1 :(得分:1)

当您实施stream().collect(supplier, accumulator, combiner)时,它们会返回void(组合器和累加器)。问题是:

  collect(WordCountState::new,
          WordCountState::accumulate,
          WordCountState::combine)

在你的情况下,实际意味着(只是累加器,但对于组合器也是如此):

     (wordCounter, character) -> {
              WordCountState state = wc.accumulate(c);
              return;
     }

确实这不是微不足道的。我们假设我们有两种方法:

public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

public WordCountState accumulate2(Character c) {
    if (Character.isWhitespace(c)) {
        return lastSpace ? this : new WordCountState(true, counter);
    } else {
        return lastSpace ? new WordCountState(false, counter + 1) : this;
    }
}

对于他们来说,下面的代码可以正常工作,但仅用于方法参考,而不是用于lambda表达式。

BiConsumer<WordCountState, Character> cons = WordCountState::accumulate;

BiConsumer<WordCountState, Character> cons2 = WordCountState::accumulate2;

你可以通过一个implementes BiConsumer类来表示它略有不同,例如:

 BiConsumer<WordCountState, Character> clazz = new BiConsumer<WordCountState, Character>() {
        @Override
        public void accept(WordCountState state, Character character) {
            WordCountState newState = state.accumulate2(character);
            return;
        }
    };

因此,您的combineaccumulate方法需要更改为:

public void combine(WordCountState wordCountState) {
    counter = counter + wordCountState.getCounter();
}


public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

答案 2 :(得分:0)

首先,使用像input.split("\\s+").length这样的词来计算字数会不会更容易吗?

如果这是溪流和收藏家的练习,我们来讨论你的实施。你已经指出了最大的错误:你的累加器和组合器不应该返回新的实例。 collect的签名告诉您它期望BiConsumer,它不会返回任何内容。因为您在累加器中创建新对象,所以永远不会增加收集器实际使用的WordCountState个对象的数量。通过在合并器中创建一个新对象,您可以放弃您可能取得的任何进展。这也是为什么你创建的对象多于输入中的字符:每个字符一个,然后返回一些字符。

见这个改编的实施:

public static class WordCountState
{
    private boolean lastSpace = true;
    private int     counter   = 0;

    public void accumulate(Character character)
    {
        if (!Character.isWhitespace(character))
        {
            if (lastSpace)
            {
                counter++;
            }
            lastSpace = false;
        }
        else
        {
            lastSpace = true;
        }
    }

    public void combine(WordCountState wordCountState)
    {
        counter += wordCountState.counter;
    }
}

在这里,我们不会在每个步骤中创建新对象,而是更改我们拥有的对象的状态。我认为您尝试创建新对象,因为您的Elvis操作员强迫您返回某些内容和/或您无法更改实例字段,因为它们是最终的。但是,它们不需要是最终的,您可以轻松地更改它们。

顺序运行这个改编的实现现在工作正常,因为我们很好地逐个查看字符,最后得到11个单词。

同时,它失败了。它似乎为每个字符创建一个新的WordCountState,但不计算所有字符,并最终在29(至少对我来说)。这显示了算法的一个基本缺陷:拆分每个字符并不起作用。想象一下输入abc abc,它应该导致2.如果你并行执行并且没有指定如何拆分输入,你最终可能会得到这些块:ab, c a, bc,这将加起来4。

问题在于,通过在字符之间进行并行化(即在单词的中间),可以使单独的WordCountState彼此依赖(因为他们需要知道哪一个在它们之前,以及它是否已经结束用空白字符)。这会破坏并行性并导致错误。

除此之外,实现Collector接口可能更容易,而不是提供三种方法:

public static class WordCountCollector
    implements Collector<Character, SimpleEntry<AtomicInteger, Boolean>, Integer>
{
    @Override
    public Supplier<SimpleEntry<AtomicInteger, Boolean>> supplier()
    {
        return () -> new SimpleEntry<>(new AtomicInteger(0), true);
    }

    @Override
    public BiConsumer<SimpleEntry<AtomicInteger, Boolean>, Character> accumulator()
    {
        return (count, character) -> {
            if (!Character.isWhitespace(character))
            {
                if (count.getValue())
                {
                    String before = count.getKey().get() + " -> ";
                    count.getKey().incrementAndGet();
                    System.out.println(before + count.getKey().get());
                }
                count.setValue(false);
            }
            else
            {
                count.setValue(true);
            }
        };
    }

    @Override
    public BinaryOperator<SimpleEntry<AtomicInteger, Boolean>> combiner()
    {
        return (c1, c2) -> new SimpleEntry<>(new AtomicInteger(c1.getKey().get() + c2.getKey().get()), false);
    }

    @Override
    public Function<SimpleEntry<AtomicInteger, Boolean>, Integer> finisher()
    {
        return count -> count.getKey().get();
    }

    @Override
    public Set<java.util.stream.Collector.Characteristics> characteristics()
    {
        return new HashSet<>(Arrays.asList(Characteristics.CONCURRENT, Characteristics.UNORDERED));
    }
}

我们使用一对(SimpleEntry)来保持计数和最后一个空间的知识。这样,我们不需要在收集器本身中实现状态或为它编写param对象。您可以像这样使用此收集器:

return charStream.parallel().collect(new WordCountCollector());

此收集器并行化比初始实现更好,但结果仍然不同(大多数在14到16之间),因为你所采用的方法存在缺点。