我希望将大Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.hdfs.connect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\TIKI_git\ai-core-python\venv\lib\site-packages\pyarrow\hdfs.py", line 183, in connect
extra_conf=extra_conf)
File "C:\TIKI_git\ai-core-python\venv\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
分组为Stream[F, A]
且内部流最多包含Stream[Stream[F, A]]
个元素。
这就是我所做的,基本上是将块通过管道传送到n
中,然后产生队列元素作为结果流。
Queue[F, Queue[F, Chunk[A]]
但这非常复杂,有没有更简单的方法?
答案 0 :(得分:1)
fs2具有chunkN
chunkLimit
个有助于分组的方法
stream.chunkN(n).map(Stream.chunk)
stream.chunkLimit(n).map(Stream.chunk)
chunkN
生成大小为n的块,直到流结束
chunkLimit
拆分现有的块,并可以生成大小可变的块。
scala> Stream(1,2,3).repeat.chunkN(2).take(5).toList
res0: List[Chunk[Int]] = List(Chunk(1, 2), Chunk(3, 1), Chunk(2, 3), Chunk(1, 2), Chunk(3, 1))
scala> (Stream(1) ++ Stream(2, 3) ++ Stream(4, 5, 6)).chunkLimit(2).toList
res0: List[Chunk[Int]] = List(Chunk(1), Chunk(2, 3), Chunk(4, 5), Chunk(6))
答案 1 :(得分:0)
除了已经提到的chunksN
,还考虑使用groupWithin
(fs2 1.0.1):
def groupWithin[F2[x] >: F[x]](n: Int, d: FiniteDuration)(implicit timer: Timer[F2], F: Concurrent[F2]): Stream[F2, Chunk[O]]
将此流划分为在时间窗口内接收到的元素组,或者受元素数量限制(以先发生者为准)。如果在给定的时间范围内无法从上游拉出任何元素,则会发生空组。
注意:每次下游拉动都会启动一个时间窗口。
我不确定您为什么要将此嵌套为流,因为要求是在一批中包含“至多n
个元素”-这意味着您要跟踪有限元素数(这正是Chunk
的作用)。无论哪种方式,Chunk
始终可以用Stream
表示为Stream.chunk
:
val chunks: Stream[F, Chunk[O]] = ???
val streamOfStreams: Stream[F, Stream[F, O]] = chunks.map(Stream.chunk)
这是如何使用groupWithin
的完整示例:
import cats.implicits._
import cats.effect.{ExitCode, IO, IOApp}
import fs2._
import scala.concurrent.duration._
object GroupingDemo extends IOApp {
override def run(args: List[String]): IO[ExitCode] = {
Stream('a, 'b, 'c).covary[IO]
.groupWithin(2, 1.second)
.map(_.toList)
.showLinesStdOut
.compile.drain
.as(ExitCode.Success)
}
}
输出:
列表('a,'b)
列表('c)
答案 2 :(得分:0)
最后我使用了一个更可靠的版本(使用 Hotswap 确保队列终止)。
def grouped(
innerSize: Int
)(implicit F: Async[F]): Stream[F, Stream[F, A]] = {
type InnerQueue = Queue[F, Option[Chunk[A]]]
type OuterQueue = Queue[F, Option[InnerQueue]]
def swapperInner(swapper: Hotswap[F, InnerQueue], outer: OuterQueue) = {
val innerRes =
Resource.make(Queue.unbounded[F, Option[Chunk[A]]])(_.offer(None))
swapper.swap(innerRes).flatTap(q => outer.offer(q.some))
}
def loopChunk(
gathered: Int,
curr: Queue[F, Option[Chunk[A]]],
chunk: Chunk[A],
newInnerQueue: F[InnerQueue]
): F[(Int, Queue[F, Option[Chunk[A]]])] = {
if (gathered + chunk.size > innerSize) {
val (left, right) = chunk.splitAt(innerSize - gathered)
curr.offer(left.some) >> newInnerQueue.flatMap { nq =>
loopChunk(0, nq, right, newInnerQueue)
}
} else if (gathered + chunk.size == innerSize) {
curr.offer(chunk.some) >> newInnerQueue.tupleLeft(
0
)
} else {
curr.offer(chunk.some).as(gathered + chunk.size -> curr)
}
}
val prepare = for {
outer <- Resource.eval(Queue.unbounded[F, Option[InnerQueue]])
swapper <- Hotswap.create[F, InnerQueue]
} yield outer -> swapper
Stream.resource(prepare).flatMap {
case (outer, swapper) =>
val newInner = swapperInner(swapper, outer)
val background = Stream.eval(newInner).flatMap { initQueue =>
s.chunks
.filter(_.nonEmpty)
.evalMapAccumulate(0 -> initQueue) { (state, chunk) =>
val (gathered, curr) = state
loopChunk(gathered, curr, chunk, newInner).tupleRight({})
}
.onFinalize(swapper.clear *> outer.offer(None))
}
val foreground = Stream
.fromQueueNoneTerminated(outer)
.map(i => Stream.fromQueueNoneTerminatedChunk(i))
foreground.concurrently(background)
}
}