我正在尝试在Haskell内快速生成随机数据,但是当我尝试使用任何惯用方法时,我会得到低速和大GC开销。
这是简短的代码:
import qualified System.Random.Mersenne as RM
import qualified Data.ByteString.Lazy as BL
import qualified System.IO as SI
import Data.Word
main = do
r <- RM.newMTGen Nothing :: IO RM.MTGen
rnd <- RM.randoms r :: IO [Word8]
BL.hPutStr SI.stdout $ BL.pack rnd
以下是快速代码:
import qualified System.Random.Mersenne as RM
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Put as DBP
import qualified System.IO as SI
import Data.List
import Control.Monad (void, forever)
import Data.Word
main = do
r <- RM.newMTGen Nothing :: IO RM.MTGen
forever $ do
x0 <- RM.random r :: IO Word32
x1 <- RM.random r :: IO Word32
x2 <- RM.random r :: IO Word32
x3 <- RM.random r :: IO Word32
x4 <- RM.random r :: IO Word32
x5 <- RM.random r :: IO Word32
x6 <- RM.random r :: IO Word32
x7 <- RM.random r :: IO Word32
x8 <- RM.random r :: IO Word32
x9 <- RM.random r :: IO Word32
xA <- RM.random r :: IO Word32
xB <- RM.random r :: IO Word32
xC <- RM.random r :: IO Word32
xD <- RM.random r :: IO Word32
xE <- RM.random r :: IO Word32
xF <- RM.random r :: IO Word32
c0 <- RM.random r :: IO Word32
c1 <- RM.random r :: IO Word32
c2 <- RM.random r :: IO Word32
c3 <- RM.random r :: IO Word32
c4 <- RM.random r :: IO Word32
c5 <- RM.random r :: IO Word32
c6 <- RM.random r :: IO Word32
c7 <- RM.random r :: IO Word32
c8 <- RM.random r :: IO Word32
c9 <- RM.random r :: IO Word32
cA <- RM.random r :: IO Word32
cB <- RM.random r :: IO Word32
cC <- RM.random r :: IO Word32
cD <- RM.random r :: IO Word32
cE <- RM.random r :: IO Word32
cF <- RM.random r :: IO Word32
v0 <- RM.random r :: IO Word32
v1 <- RM.random r :: IO Word32
v2 <- RM.random r :: IO Word32
v3 <- RM.random r :: IO Word32
v4 <- RM.random r :: IO Word32
v5 <- RM.random r :: IO Word32
v6 <- RM.random r :: IO Word32
v7 <- RM.random r :: IO Word32
v8 <- RM.random r :: IO Word32
v9 <- RM.random r :: IO Word32
vA <- RM.random r :: IO Word32
vB <- RM.random r :: IO Word32
vC <- RM.random r :: IO Word32
vD <- RM.random r :: IO Word32
vE <- RM.random r :: IO Word32
vF <- RM.random r :: IO Word32
b0 <- RM.random r :: IO Word32
b1 <- RM.random r :: IO Word32
b2 <- RM.random r :: IO Word32
b3 <- RM.random r :: IO Word32
b4 <- RM.random r :: IO Word32
b5 <- RM.random r :: IO Word32
b6 <- RM.random r :: IO Word32
b7 <- RM.random r :: IO Word32
b8 <- RM.random r :: IO Word32
b9 <- RM.random r :: IO Word32
bA <- RM.random r :: IO Word32
bB <- RM.random r :: IO Word32
bC <- RM.random r :: IO Word32
bD <- RM.random r :: IO Word32
bE <- RM.random r :: IO Word32
bF <- RM.random r :: IO Word32
BL.hPutStr SI.stdout $ DBP.runPut $ do
DBP.putWord32be x0
DBP.putWord32be x1
DBP.putWord32be x2
DBP.putWord32be x3
DBP.putWord32be x4
DBP.putWord32be x5
DBP.putWord32be x6
DBP.putWord32be x7
DBP.putWord32be x8
DBP.putWord32be x9
DBP.putWord32be xA
DBP.putWord32be xB
DBP.putWord32be xC
DBP.putWord32be xD
DBP.putWord32be xE
DBP.putWord32be xF
DBP.putWord32be c0
DBP.putWord32be c1
DBP.putWord32be c2
DBP.putWord32be c3
DBP.putWord32be c4
DBP.putWord32be c5
DBP.putWord32be c6
DBP.putWord32be c7
DBP.putWord32be c8
DBP.putWord32be c9
DBP.putWord32be cA
DBP.putWord32be cB
DBP.putWord32be cC
DBP.putWord32be cD
DBP.putWord32be cE
DBP.putWord32be cF
DBP.putWord32be v0
DBP.putWord32be v1
DBP.putWord32be v2
DBP.putWord32be v3
DBP.putWord32be v4
DBP.putWord32be v5
DBP.putWord32be v6
DBP.putWord32be v7
DBP.putWord32be v8
DBP.putWord32be v9
DBP.putWord32be vA
DBP.putWord32be vB
DBP.putWord32be vC
DBP.putWord32be vD
DBP.putWord32be vE
DBP.putWord32be vF
DBP.putWord32be b0
DBP.putWord32be b1
DBP.putWord32be b2
DBP.putWord32be b3
DBP.putWord32be b4
DBP.putWord32be b5
DBP.putWord32be b6
DBP.putWord32be b7
DBP.putWord32be b8
DBP.putWord32be b9
DBP.putWord32be bA
DBP.putWord32be bB
DBP.putWord32be bC
DBP.putWord32be bD
DBP.putWord32be bE
DBP.putWord32be bF
短代码在我的计算机上每秒输出大约6兆字节的随机字节。 快速代码 - 每秒约150兆字节。
如果我在快速代码中将该变量的数量从64减少到16,则速度降至每秒78兆字节。
如何在不减慢速度的情况下使这段代码变得紧凑和惯用?
答案 0 :(得分:9)
我不认为懒惰的IO在Haskell中被认为是非常惯用的。它可能适用于单行,但对于大型IO密集型程序,haskellers使用iteratees / conduit / pipes / Oleg-knows-what。
首先,要创建一个参考点,在Linux x86-64上使用GHC 7.6.3(-O2 --make
)编译在我的计算机上运行原始版本的一些统计信息。慢懒字节字符串版本:
$ ./rnd +RTS -s | pv | head -c 100M > /dev/null
100MB 0:00:09 [10,4MB/s] [ <=> ]
6,843,934,360 bytes allocated in the heap
2,065,144 bytes copied during GC
68,000 bytes maximum residency (2 sample(s))
18,016 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity 99.2% of total user, 97.7% of total elapsed
它不是非常快,但没有GC和内存开销可言。有趣的是,使用此代码可以获得37%的GC时间。
具有展开循环的快速版本:
$ ./rndfast +RTS -s | pv | head -c 500M > /dev/null
500MB 0:00:04 [ 110MB/s] [ <=> ]
69,434,953,224 bytes allocated in the heap
9,225,128 bytes copied during GC
68,000 bytes maximum residency (2 sample(s))
18,016 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity 85.0% of total user, 72.7% of total elapsed
这要快得多,但有趣的是,现在我们有了15%的GC开销。
最后,我的版本使用了导管和blaze-builder。它一次生成512个随机Word64
,以生成4 KB数据块以供下游使用。随着我将列表“缓冲区”大小从32增加到512,性能稳步提高,但改进小于128。
import Blaze.ByteString.Builder (Builder)
import Blaze.ByteString.Builder.Word
import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Conduit.Blaze (builderToByteString)
import Data.Word
import System.IO (stdout)
import qualified System.Random.Mersenne as RM
randomStream :: RM.MTGen -> Source IO Builder
randomStream gen = forever $ do
words <- liftIO $ RM.randoms gen
yield $ fromWord64shost $ take 512 words
main :: IO ()
main = do
gen <- RM.newMTGen Nothing
randomStream gen $= builderToByteString $$ CB.sinkHandle stdout
我注意到,与上面的两个程序不同,当使用-fllvm
编译时,它稍微快(3-4%),因此下面的输出来自LLVM 3.3生成的二进制文件。
$ ./rndconduit +RTS -s | pv | head -c 500M > /dev/null
500MB 0:00:09 [53,2MB/s] [ <=> ]
8,889,236,736 bytes allocated in the heap
10,912,024 bytes copied during GC
36,376 bytes maximum residency (2 sample(s))
19,024 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity 99.0% of total user, 91.9% of total elapsed
因此,它的速度是手动展开版本的两倍,但几乎与懒惰的IO版本一样短且可读,几乎没有GC开销和可预测的内存行为。也许这里有改进的余地:欢迎提出意见。
更新:
结合一些不安全的字节摆弄导管我能够制作生成300+ MB / s随机数据的程序。看起来简单的类型专用尾递归函数比惰性列表和手动展开都更好。
import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Word
import Foreign.Marshal.Array
import Foreign.Ptr
import Foreign.Storable
import System.IO (stdout)
import qualified System.Random.Mersenne as RM
randomChunk :: RM.MTGen -> Int -> IO ByteString
randomChunk gen bufsize = allocaArray bufsize $ \ptr -> do
loop ptr bufsize
B.packCStringLen (castPtr ptr, bufsize * sizeOf (undefined :: Word64))
where
loop :: Ptr Word64 -> Int -> IO ()
loop ptr 0 = return ()
loop ptr n = do
x <- RM.random gen
pokeElemOff ptr n x
loop ptr (n - 1)
chunkStream :: RM.MTGen -> Source IO ByteString
chunkStream gen = forever $ liftIO (randomChunk gen 512) >>= yield
main :: IO ()
main = do
gen <- RM.newMTGen Nothing
chunkStream gen $$ CB.sinkHandle stdout
在此速度下,IO开销实际上变得明显:程序在系统调用中花费超过四分之一的运行时间,并且像上面示例中那样将head
添加到管道会大大减慢它。
$ ./rndcond +RTS -s | pv > /dev/null
^C27GB 0:00:10 [ 338MB/s] [ <=> ]
8,708,628,512 bytes allocated in the heap
1,646,536 bytes copied during GC
36,168 bytes maximum residency (2 sample(s))
17,080 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity 98.7% of total user, 73.6% of total elapsed
答案 1 :(得分:2)
我可以确认第二个版本比第一个版本慢,但程度不同。在10秒内,短代码生成111M的数据,而大代码生成833M的数据。这是在Mac OSX Lion上完成的,使用7.6.3和-O3编译。
虽然我不知道为什么第一个解决方案太慢,但使用replicateM
和mapM
来删除重复可以简化第二个解决方案:
main3 = do
r <- RM.newMTGen Nothing :: IO RM.MTGen
forever $ do
vals <- sequence $ replicate 64 (RM.random r)
BL.hPutStr SI.stdout $ DBP.runPut $ mapM_ DBP.putWord32be vals
此解决方案仍然较慢,在10秒内生成492M的数据。最后的最后一个解决方案是使用模板haskell,来生成展开循环的代码:
main4 = do
r <- RM.newMTGen Nothing :: IO RM.MTGen
forever $ do
$(let varCount = 64
-- | replaces every instance of oldName with newName in the exp
replaceNames :: (Typeable t, Data t) => String -> Name -> t -> t
replaceNames oldName replacementName expr = everywhere (mkT changeName) expr where
changeName name | nameBase name == oldName = replacementName
| otherwise = name
singleVarExp :: Name -> ExpQ -> ExpQ
singleVarExp varName next = replaceNames "patternvar" varName <$> [| RM.random r >>= \patternvar -> $(next) |]
allVarExps :: [Name] -> ExpQ -> ExpQ
allVarExps (n:ns) next = foldr (\var result -> singleVarExp var result)
(singleVarExp n next) ns
singleOutputter :: Name -> ExpQ -> ExpQ
singleOutputter varName next = [| DBP.putWord32be $(varE varName) >> $(next) |]
allVarOutput :: [Name] -> ExpQ
allVarOutput (n:ns) = foldr (\var result -> singleOutputter var result)
(singleOutputter n [| return () |]) ns
printResultExp :: [Name] -> ExpQ
printResultExp names = [| BL.hPutStr SI.stdout $ DBP.runPut ($(allVarOutput names)) |]
result = do
vars <- replicateM varCount $ newName "x"
allVarExps vars (printResultExp vars)
in result)
此速度与原始快速版本大致相同。它不是很整洁(您的快速解决方案更容易阅读),但您现在可以轻松地更改变量的数量,并且仍然可以展开循环。我尝试了512,但除了使编译时间变大之外,它似乎对性能没有太大影响。