如何将大流分组为子流

时间:2018-10-23 06:06:25

标签: scala fs2

我希望将大Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> pyarrow.hdfs.connect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\TIKI_git\ai-core-python\venv\lib\site-packages\pyarrow\hdfs.py", line 183, in connect extra_conf=extra_conf) File "C:\TIKI_git\ai-core-python\venv\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unable to load libjvm 分组为Stream[F, A]且内部流最多包含Stream[Stream[F, A]]个元素。

这就是我所做的,基本上是将块通过管道传送到n中,然后产生队列元素作为结果流。

Queue[F, Queue[F, Chunk[A]]

但这非常复杂,有没有更简单的方法?

3 个答案:

答案 0 :(得分:1)

fs2具有chunkN chunkLimit个有助于分组的方法

stream.chunkN(n).map(Stream.chunk)

stream.chunkLimit(n).map(Stream.chunk)

chunkN生成大小为n的块,直到流结束

chunkLimit拆分现有的块,并可以生成大小可变的块。

scala> Stream(1,2,3).repeat.chunkN(2).take(5).toList
res0: List[Chunk[Int]] = List(Chunk(1, 2), Chunk(3, 1), Chunk(2, 3), Chunk(1, 2), Chunk(3, 1))

scala> (Stream(1) ++ Stream(2, 3) ++ Stream(4, 5, 6)).chunkLimit(2).toList
res0: List[Chunk[Int]] = List(Chunk(1), Chunk(2, 3), Chunk(4, 5), Chunk(6))

答案 1 :(得分:0)

除了已经提到的chunksN,还考虑使用groupWithin(fs2 1.0.1):

  

def groupWithin[F2[x] >: F[x]](n: Int, d: FiniteDuration)(implicit timer: Timer[F2], F: Concurrent[F2]): Stream[F2, Chunk[O]]

     

将此流划分为在时间窗口内接收到的元素组,或者受元素数量限制(以先发生者为准)。如果在给定的时间范围内无法从上游拉出任何元素,则会发生空组。

     

注意:每次下游拉动都会启动一个时间窗口。

我不确定您为什么要将此嵌套为流,因为要求是在一批中包含“至多n个元素”-这意味着您要跟踪有限元素数(这正是Chunk的作用)。无论哪种方式,Chunk始终可以用Stream表示为Stream.chunk

val chunks: Stream[F, Chunk[O]] = ???
val streamOfStreams:  Stream[F, Stream[F, O]] = chunks.map(Stream.chunk)

这是如何使用groupWithin的完整示例:

import cats.implicits._
import cats.effect.{ExitCode, IO, IOApp}
import fs2._
import scala.concurrent.duration._

object GroupingDemo extends IOApp {
  override def run(args: List[String]): IO[ExitCode] = {
    Stream('a, 'b, 'c).covary[IO]
      .groupWithin(2, 1.second)
      .map(_.toList)
      .showLinesStdOut
      .compile.drain
      .as(ExitCode.Success)
  }
}

输出:

  

列表('a,'b)

     

列表('c)

答案 2 :(得分:0)

最后我使用了一个更可靠的版本(使用 Hotswap 确保队列终止)。

  def grouped(
      innerSize: Int
    )(implicit F: Async[F]): Stream[F, Stream[F, A]] = {

      type InnerQueue = Queue[F, Option[Chunk[A]]]
      type OuterQueue = Queue[F, Option[InnerQueue]]

      def swapperInner(swapper: Hotswap[F, InnerQueue], outer: OuterQueue) = {
        val innerRes =
          Resource.make(Queue.unbounded[F, Option[Chunk[A]]])(_.offer(None))
        swapper.swap(innerRes).flatTap(q => outer.offer(q.some))
      }

      def loopChunk(
        gathered: Int,
        curr: Queue[F, Option[Chunk[A]]],
        chunk: Chunk[A],
        newInnerQueue: F[InnerQueue]
      ): F[(Int, Queue[F, Option[Chunk[A]]])] = {
        if (gathered + chunk.size > innerSize) {
          val (left, right) = chunk.splitAt(innerSize - gathered)
          curr.offer(left.some) >> newInnerQueue.flatMap { nq =>
            loopChunk(0, nq, right, newInnerQueue)
          }
        } else if (gathered + chunk.size == innerSize) {
          curr.offer(chunk.some) >> newInnerQueue.tupleLeft(
            0
          )
        } else {
          curr.offer(chunk.some).as(gathered + chunk.size -> curr)
        }
      }

      val prepare = for {
        outer   <- Resource.eval(Queue.unbounded[F, Option[InnerQueue]])
        swapper <- Hotswap.create[F, InnerQueue]
      } yield outer -> swapper

      Stream.resource(prepare).flatMap {
        case (outer, swapper) =>
          val newInner = swapperInner(swapper, outer)
          val background = Stream.eval(newInner).flatMap { initQueue =>
            s.chunks
              .filter(_.nonEmpty)
              .evalMapAccumulate(0 -> initQueue) { (state, chunk) =>
                val (gathered, curr) = state
                loopChunk(gathered, curr, chunk, newInner).tupleRight({})
              }
              .onFinalize(swapper.clear *> outer.offer(None))
          }
          val foreground = Stream
            .fromQueueNoneTerminated(outer)
            .map(i => Stream.fromQueueNoneTerminatedChunk(i))
          foreground.concurrently(background)
      }

    }