Question

这两个程序做同样的事情，但是一个程序运行速度快10倍。

这大约需要在我的机器上10秒钟：

import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL

theValueOne=B.singleton 1

main = replicateM_ 100000000 $ B.putStr theValueOne

第二个版本使用输出延迟IO。它大约在1秒钟内完成（与c一样快）：

import qualified Data.ByteString.Lazy as BL

main = BL.putStr $ BL.pack $ replicate 100000000 1

问题：为什么非惰性版本这么慢？更重要的是，我该如何快速？（我尝试过递归forM，使用hSetBuffering修改输出缓冲区...没什么改变）

注意：这不仅仅是一个学术问题。非惰性版本是我公司在生产中使用的可执行文件的极度简化版本，它的运行速度也很慢。围绕类似的惰性解决方案重新构造大型程序几乎是不可能的。

Answer 1

已更新：添加了可能的问题来源和解决方案。

我认为这与惰性I / O没有任何关系。如果重写严格的I / O版本以一次写入两个字节：

theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)

将时间减半。一次写入十个字节：

theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))

，它已经比惰性I / O版本要快。

问题是B.hPutStr调用中有一些开销，比C fwrite调用中的开销要多得多，而且这不是写单个字节的特别有效的方法。

很大一部分开销来自Haskell I / O缓冲区具有不变的元数据这一事实。即使缓冲区 content 本身是可变的，但指向缓冲区are immutable中有效数据的指针，因此写入单个字节需要对新的GHC.IO.Buffer.Buffer结构进行堆分配，该结构GHC无法优化

一种解决方案是使用带有可变指针的手工缓冲结构。下面的方法有效，其速度是原始问题中的惰性I / O版本的两倍。

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !(IORef Int)
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
  n <- readIORef used
  withForeignPtr content $ \p -> hPutBuf handle p n
  writeIORef used 0

writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf@(WriteBuffer{..}) = do
  n <- readIORef used
  withForeignPtr content $ \p -> poke (plusPtr p n) w
  let n' = n + 1
  writeIORef used n'
  when (n' == capacity) $
    flushBuffer buf

main :: IO ()
main = do
  b <- newBuffer stdout
  replicateM_ 100000000 (writeByte 1 b)
  flushBuffer b

具有讽刺意味的是，使用不可变计数器将其转换为版本并通过WriteBuffer作为状态传递foldM会使速度再次加倍，因此它的速度约为惰性I / O版本的4倍在原始问题中：

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !Int
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf@WriteBuffer{..} = do
  withForeignPtr content $ \p -> hPutBuf handle p used
  return $ buf { used = 0 }

writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf@(WriteBuffer{..}) = do
  withForeignPtr content $ \p -> poke (plusPtr p used) w
  let used' = used + 1
      buf' = buf { used = used' }
  if (used' == capacity)
    then flushBuffer buf'
    else return buf'

main :: IO ()
main = do
  b <- newBuffer stdout
  b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
  void (flushBuffer b')

这个速度之所以如此之快似乎是因为GHC能够完全从折叠中优化WriteBuffer构造函数，并且只需在循环中传递未装箱的指针和整数。我的猜测是，如果我修改了上面的可变版本，以避免对used IORef中的整数进行装箱和拆箱，那么它的速度将类似。

在GHC中，为什么这个小程序的惰性版本比基于循环的变体快得多？

1 个答案: