I'm trying to find number of words in given string. Below is sequential algorithm for it which works fine.
public int getWordcount() {
boolean lastSpace = true;
int result = 0;
for(char c : str.toCharArray()){
if(Character.isWhitespace(c)){
lastSpace = true;
}else{
if(lastSpace){
lastSpace = false;
++result;
}
}
}
return result;
}
But, when i tried to 'parallelize' this with Stream.collect(supplier, accumulator, combiner) method, i am getting wordCount = 0. I am using an immutable class (WordCountState) just to maintain the state of word count.
Code :
public class WordCounter {
private final String str = "Java8 parallelism helps if you know how to use it properly.";
public int getWordCountInParallel() {
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.collect(WordCountState::new,
WordCountState::accumulate,
WordCountState::combine);
return finalState.getCounter();
}
}
public class WordCountState {
private final boolean lastSpace;
private final int counter;
private static int numberOfInstances = 0;
public WordCountState(){
this.lastSpace = true;
this.counter = 0;
//numberOfInstances++;
}
public WordCountState(boolean lastSpace, int counter){
this.lastSpace = lastSpace;
this.counter = counter;
//numberOfInstances++;
}
//accumulator
public WordCountState accumulate(Character c) {
if(Character.isWhitespace(c)){
return lastSpace ? this : new WordCountState(true, counter);
}else{
return lastSpace ? new WordCountState(false, counter + 1) : this;
}
}
//combiner
public WordCountState combine(WordCountState wordCountState) {
//System.out.println("Returning new obj with count : " + (counter + wordCountState.getCounter()));
return new WordCountState(this.isLastSpace(),
(counter + wordCountState.getCounter()));
}
I've observed two issues with above code : 1. Number of objects (WordCountState) created are greater than number of characters in the string. 2. Result is always 0. 3. As per accumulator/consumer documentation, shouldn't the accumulator return void? Even though my accumulator method is returning an object, compiler doesn't complain.
Any clue where i might have gone off track?
UPDATE : Used solution as below -
public int getWordCountInParallel() {
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.reduce(new WordCountState(),
WordCountState::accumulate,
WordCountState::combine);
return finalState.getCounter();
}
答案 0 :(得分:4)
您始终可以调用方法并忽略其返回值,因此在使用方法引用时允许相同的方法是合乎逻辑的。因此,只要参数匹配,当需要使用者时,创建非void
方法的方法引用是没有问题的。
您使用不可变WordCountState
类创建的内容是 reduction 操作,即它支持像
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.map(ch -> new WordCountState().accumulate(ch))
.reduce(new WordCountState(), WordCountState::combine);
而collect
方法支持 mutable reduction ,其中容器实例(可能与结果相同)被修改。
您的解决方案中仍然存在逻辑错误,因为每个WordCountState
实例都假设具有前面的空格字符,而不知道实际情况并且没有尝试在组合器中修复此问题。
修复和简化此方法的方法仍然是:
public int getWordCountInParallel() {
return str.codePoints().parallel()
.mapToObj(WordCountState::new)
.reduce(WordCountState::new)
.map(WordCountState::getResult).orElse(0);
}
public class WordCountState {
private final boolean firstSpace, lastSpace;
private final int counter;
public WordCountState(int character){
firstSpace = lastSpace = Character.isWhitespace(character);
this.counter = 0;
}
public WordCountState(WordCountState a, WordCountState b) {
this.firstSpace = a.firstSpace;
this.lastSpace = b.lastSpace;
this.counter = a.counter + b.counter + (a.lastSpace && !b.firstSpace? 1: 0);
}
public int getResult() {
return counter+(firstSpace? 0: 1);
}
}
如果您担心WordCountState
个实例的数量,请注意与您的初始方法相比,此解决方案无法创建的Character
个实例数。
但是,如果您将WordCountState
重写为可变结果容器,则此任务确实适合于可变缩减:
public int getWordCountInParallel() {
return str.codePoints().parallel()
.collect(WordCountState::new, WordCountState::accumulate, WordCountState::combine)
.getResult();
}
public class WordCountState {
private boolean firstSpace, lastSpace=true, initial=true;
private int counter;
public void accumulate(int character) {
boolean white=Character.isWhitespace(character);
if(lastSpace && !white) counter++;
lastSpace=white;
if(initial) {
firstSpace=white;
initial=false;
}
}
public void combine(WordCountState b) {
if(initial) {
this.initial=b.initial;
this.counter=b.counter;
this.firstSpace=b.firstSpace;
this.lastSpace=b.lastSpace;
}
else if(!b.initial) {
this.counter += b.counter;
if(!lastSpace && !b.firstSpace) counter--;
this.lastSpace = b.lastSpace;
}
}
public int getResult() {
return counter;
}
}
注意如何使用int
一致地表示unicode字符,允许使用codePoint()
的{{1}}流,这不仅更简单,而且还处理基本多语言平面之外的字符并且可能更有效率,因为它不需要装箱到CharSequence
个实例。
答案 1 :(得分:1)
当您实施stream().collect(supplier, accumulator, combiner)
时,它们会返回void
(组合器和累加器)。问题是:
collect(WordCountState::new,
WordCountState::accumulate,
WordCountState::combine)
在你的情况下,实际意味着(只是累加器,但对于组合器也是如此):
(wordCounter, character) -> {
WordCountState state = wc.accumulate(c);
return;
}
确实这不是微不足道的。我们假设我们有两种方法:
public void accumulate(Character c) {
if (!Character.isWhitespace(c)) {
counter++;
}
}
public WordCountState accumulate2(Character c) {
if (Character.isWhitespace(c)) {
return lastSpace ? this : new WordCountState(true, counter);
} else {
return lastSpace ? new WordCountState(false, counter + 1) : this;
}
}
对于他们来说,下面的代码可以正常工作,但仅用于方法参考,而不是用于lambda表达式。
BiConsumer<WordCountState, Character> cons = WordCountState::accumulate;
BiConsumer<WordCountState, Character> cons2 = WordCountState::accumulate2;
你可以通过一个implementes BiConsumer
类来表示它略有不同,例如:
BiConsumer<WordCountState, Character> clazz = new BiConsumer<WordCountState, Character>() {
@Override
public void accept(WordCountState state, Character character) {
WordCountState newState = state.accumulate2(character);
return;
}
};
因此,您的combine
和accumulate
方法需要更改为:
public void combine(WordCountState wordCountState) {
counter = counter + wordCountState.getCounter();
}
public void accumulate(Character c) {
if (!Character.isWhitespace(c)) {
counter++;
}
}
答案 2 :(得分:0)
首先,使用像input.split("\\s+").length
这样的词来计算字数会不会更容易吗?
如果这是溪流和收藏家的练习,我们来讨论你的实施。你已经指出了最大的错误:你的累加器和组合器不应该返回新的实例。 collect
的签名告诉您它期望BiConsumer
,它不会返回任何内容。因为您在累加器中创建新对象,所以永远不会增加收集器实际使用的WordCountState
个对象的数量。通过在合并器中创建一个新对象,您可以放弃您可能取得的任何进展。这也是为什么你创建的对象多于输入中的字符:每个字符一个,然后返回一些字符。
见这个改编的实施:
public static class WordCountState
{
private boolean lastSpace = true;
private int counter = 0;
public void accumulate(Character character)
{
if (!Character.isWhitespace(character))
{
if (lastSpace)
{
counter++;
}
lastSpace = false;
}
else
{
lastSpace = true;
}
}
public void combine(WordCountState wordCountState)
{
counter += wordCountState.counter;
}
}
在这里,我们不会在每个步骤中创建新对象,而是更改我们拥有的对象的状态。我认为您尝试创建新对象,因为您的Elvis操作员强迫您返回某些内容和/或您无法更改实例字段,因为它们是最终的。但是,它们不需要是最终的,您可以轻松地更改它们。
顺序运行这个改编的实现现在工作正常,因为我们很好地逐个查看字符,最后得到11个单词。
同时,它失败了。它似乎为每个字符创建一个新的WordCountState
,但不计算所有字符,并最终在29(至少对我来说)。这显示了算法的一个基本缺陷:拆分每个字符并不起作用。想象一下输入abc abc
,它应该导致2.如果你并行执行并且没有指定如何拆分输入,你最终可能会得到这些块:ab, c a, bc
,这将加起来4。
问题在于,通过在字符之间进行并行化(即在单词的中间),可以使单独的WordCountState
彼此依赖(因为他们需要知道哪一个在它们之前,以及它是否已经结束用空白字符)。这会破坏并行性并导致错误。
除此之外,实现Collector
接口可能更容易,而不是提供三种方法:
public static class WordCountCollector
implements Collector<Character, SimpleEntry<AtomicInteger, Boolean>, Integer>
{
@Override
public Supplier<SimpleEntry<AtomicInteger, Boolean>> supplier()
{
return () -> new SimpleEntry<>(new AtomicInteger(0), true);
}
@Override
public BiConsumer<SimpleEntry<AtomicInteger, Boolean>, Character> accumulator()
{
return (count, character) -> {
if (!Character.isWhitespace(character))
{
if (count.getValue())
{
String before = count.getKey().get() + " -> ";
count.getKey().incrementAndGet();
System.out.println(before + count.getKey().get());
}
count.setValue(false);
}
else
{
count.setValue(true);
}
};
}
@Override
public BinaryOperator<SimpleEntry<AtomicInteger, Boolean>> combiner()
{
return (c1, c2) -> new SimpleEntry<>(new AtomicInteger(c1.getKey().get() + c2.getKey().get()), false);
}
@Override
public Function<SimpleEntry<AtomicInteger, Boolean>, Integer> finisher()
{
return count -> count.getKey().get();
}
@Override
public Set<java.util.stream.Collector.Characteristics> characteristics()
{
return new HashSet<>(Arrays.asList(Characteristics.CONCURRENT, Characteristics.UNORDERED));
}
}
我们使用一对(SimpleEntry
)来保持计数和最后一个空间的知识。这样,我们不需要在收集器本身中实现状态或为它编写param对象。您可以像这样使用此收集器:
return charStream.parallel().collect(new WordCountCollector());
此收集器并行化比初始实现更好,但结果仍然不同(大多数在14到16之间),因为你所采用的方法存在缺点。