概括Haskell“流式”库的合并功能

时间:2019-12-04 06:44:41

标签: haskell streaming heap haskell-streaming

目标是推广Streaming.merge函数,

merge :: (Monad m, Ord a) => Stream (Of a) m r -> Stream (Of a) m s -> Stream (Of a) m (r, s) 

到任意数量的源流。策略是使用Stream (Of a) m r排序的a中的Data.Heap.Heap。即bigMerge将具有签名

bigMerge :: (Monad m, Ord a) => [Stream (Of a) m r] -> Stream (Of a) m [r]

(该列表也可以用Heap代替。)

我所拥有的是一种非常邪恶的调和,这不是很正确。在这里:

对于完整性,首先要导入:

import qualified Data.Heap as H
import Data.Heap (Heap)
import Data.List (sortBy)
import Data.Function (on)
import Streaming
import qualified Streaming.Prelude as S
import Streaming.Internal (Stream(..))  -- shouldn't!

要使用Heap,需要一个类Ord的元素:

data Elt a m r = Elt Int (Maybe a) (Stream(Of a) m r)

引入了额外的Int来携带输入列表中流的索引,以便可以使用正确顺序的元素构建返回的[r]Maybe a携带流的当前值。

EqOrd实例是:

instance Eq a => Eq (Elt a m r) where
(Elt i ma _) == (Elt i' ma' _) = 
    if i == i' then error "Internal error: Index clash in ==" 
    else ma == ma'

instance Ord a => Ord (Elt a m r) where
(Elt i ma s) <= (Elt i' ma' s') | i==i' = error "Internal error: Index clash in <="
                                | otherwise = cmp (i, ma, s) (i', ma', s')
    where 
    cmp _                      (_, Nothing, Return _) = True
    cmp (_, Nothing, Return _) _                      = False
    cmp (i, Just a, _)         (i', Just a', _)       = if a == a' then i <= i' else a <= a'
    cmp (i, _, _)              (i', _, _)             = i <= i'

基本上,任何事物都是<=Return,所有其他情况都使用a和/或iElt进行排序。 (errors用于调试。)

某些帮助器函数根据Elt的列表创建Stream的{​​{1}}和Heap的列表中的Stream

eltFromStream :: (Monad m, Ord a) => Int -> Stream (Of a) m r -> m (Elt a m r)
eltFromStream i (Return r) = return $ Elt i Nothing (Return r)
eltFromStream i (Effect m) = do
    stream' <- m
    return $ Elt i Nothing stream'
eltFromStream i (Step (a :> rest)) = return $ Elt i (Just a) rest

heapFromStreams :: (Monad m, Ord a) => [Stream (Of a) m r] -> m (Heap (Elt a m r))
heapFromStreams strs = H.fromList <$> (sequence $ fmap (uncurry eltFromStream) (zip [0..] strs))

核心部分是loop函数

loop :: (Monad m, Ord a) => Heap (Elt a m r) -> m (Heap (Elt a m r))
loop h = do
let (Elt i ma s, h') = unsafeUncons h
elt <- case s of
    Return r         -> return $ Elt i Nothing (Return r)
    Effect m         -> Elt i Nothing <$> m
    Step (a :> rest) -> return $ Elt i (Just a) rest
return $ H.insert elt h'

unsafeUncons厚脸皮

unsafeUncons :: Heap a -> (a, Heap a)
unsafeUncons h = case H.uncons h of
Nothing -> error "Internal error"
Just x -> x

loop中使用了heapMerge函数,该函数将Heap变成了Stream

heapMerge :: (Monad m, Ord a) => Heap (Elt a m r) -> Stream (Of a) m [r]
heapMerge h = case (ma,s) of
    (Nothing, Return _) -> Return $ getRs h
    (_, Effect m) -> error "TODO"
    (Just a, _)  -> do
        h' <- lift $ loop h
        Step (a :> heapMerge h')
    where
        Elt i ma s = H.minimum h

getRs只是将Return的值组合到一个列表中

getRs :: (Monad m, Ord a) => Heap (Elt a m r) -> [r]
getRs h = snd <$> sortBy (compare `on` fst) (map f (H.toUnsortedList h))
where
    f :: Monad m => Elt a m r -> (Int, r)
    f (Elt i _ (Return r)) = (i,r)
    f _ = error "Internal error: Call getR only after stream has finished!"

然后,最后,

bigMerge :: (Monad m, Ord a) => [Stream (Of a) m r] -> Stream (Of a) m [r]
bigMerge streams = 
if null streams then Return [] 
else do
    h <- lift $ heapFromStreams streams
    heapMerge h

这令人费解,Effect的处理不正确,它依赖于ReturnStepEffect而不是inspectnext 。确实可以在简单的输入上产生正确的结果,例如

s1 = S.each [2,4,5::Int]
s2 = S.each [1,2,4,5::Int]
s3 = S.each [3::Int]
S.print $ merge [s1,s2,s3]

我敢肯定,有一种方法可以正确,更习惯地执行此操作。一方面,Maybe a中的Elt可能是多余的,我可以直接使(Stream (Of a) m r)成为Ord的实例,并且如果Effect只是模式-matched,不执行,那么应该可以。但是Stream (Of (Heap (Stream (Of a) m r, Int))) (Heap (Int,r))看起来很奇怪。 “具有索引的流” IStream a m r = IStream Int ((Heap (Stream (Of a) m r) deriving Functorr的函子,因此,如果使用适当的==<=,我会看Stream (IStream a m) m (Heap (Int, r))吗?< / p>

streaming库的这种功能方面对我来说还是个难题,因此,我们将不胜感激。

1 个答案:

答案 0 :(得分:2)

bigMerge的签名看上去很让人 ,就像Data.TraversablesequenceA的签名一样:

sequenceA :: Applicative f => [f r] -> f [r]

当然,问题在于我们不能对Applicative使用标准的Stream实例,因为它是串联的而不是合并的。但是我们可以尝试通过新类型创建自己的实例:

{-# LANGUAGE DeriveFunctor #-}
import Streaming
import qualified Streaming.Prelude as S

newtype MergeStream a m r = 
    MergeStream { getMergeStream :: Stream (Of a) m r } deriving Functor

-- BEWARE! Only valid for ORDERED streams!
instance (Monad m, Ord a) => Applicative (MergeStream a m) where
    pure x = MergeStream (pure x)
    MergeStream f <*> MergeStream x = MergeStream (uncurry ($) <$> S.merge f x) 

现在,使用示例中的s1s2s3以及标准的Traversable函数:

ghci> S.toList_ $ getMergeStream . traverse MergeStream $ [s1,s2,s3]
[1,2,2,3,4,4,5,5]

这似乎有效。也就是说,出于效率原因,您尝试使用bigMerge内部和堆来实现Stream还是值得的。