Question

我正在使用向量库和状态monad在Haskell中编写一个最长的公共子序列算法（以封装Miller O(NP) algorithm的非常强制性和可变性）。我已经用C语言编写了一些我需要它的项目，现在我正在用Haskell编写它作为一种探索如何编写具有匹配C的良好性能的命令式网格遍历算法的方法。我用unboxed向量编写的版本对于相同的输入，它比C版慢大约4倍（并使用正确的优化标志编译 - 我使用系统时钟时间和Criterion方法来验证Haskell和C版本之间的相对时间测量，以及相同的数据类型，大小输入）。我一直试图弄清楚性能问题可能在哪里，并会欣赏反馈 - 有可能在这里遇到一些众所周知的性能问题，特别是在我在这里大量使用的矢量库中。

在我的代码中，我有一个名为gridWalk的函数，它最常被调用，并且还完成了大部分工作。性能下降很可能会出现，但我无法弄清楚它可能是什么。完整的Haskell代码为here。以下代码的片段：

import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when) 
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int


type MVI1 s  = MVector (PrimState (ST s)) Int

cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
               where
                 n = U.length a
                 m = U.length b
                 go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
                                    | otherwise = len

-- function to find previous y on diagonal k for furthest point 
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
              let k0 = k+offset-1
                  k1 = k+offset+1
              y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
              y1 <- MU.unsafeRead fp k1
              if y0 > y1 then return (k0,y0)
              else return (k1,y1)
{-#INLINE findYP #-}

gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
   let !offset = 1+U.length a
   (!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset                          
   let xp = yp-k
       len = {-#SCC cmp #-} cmp a b xp yp
       x = xp+len
       y = yp+len

   {-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y  
   return ()
{-#INLINE gridWalk #-}

-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s ->  Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}

我添加了一些成本中心注释，并使用某个LCS输入进行分析以进行测试。这是我得到的：

  total time  =        2.39 secs   (2394 ticks @ 1000 us, 1 processor)
  total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

gridWalk    Main       67.5   52.7
findSnakes  Main       23.2   27.8
cmp         Main        4.2    0.0
findYP      Main        3.5   19.4
updateFP    Main        1.6    0.0


                                                         individual     inherited
COST CENTRE    MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN           MAIN                     64           0    0.0    0.0   100.0  100.0
 main          Main                    129           0    0.0    0.0     0.0    0.0
 CAF           Main                    127           0    0.0    0.0   100.0  100.0
  findSnakes   Main                    141           0    0.0    0.0     0.0    0.0
  main         Main                    128           1    0.0    0.0   100.0  100.0
   findSnakes  Main                    138           0    0.0    0.0     0.0    0.0
    gridWalk   Main                    139           0    0.0    0.0     0.0    0.0
     cmp       Main                    140           0    0.0    0.0     0.0    0.0
   while       Main                    132        4001    0.1    0.0   100.0  100.0
    findSnakes Main                    133       12000   23.2   27.8    99.9   99.9
     gridWalk  Main                    134    16004000   67.5   52.7    76.7   72.2
      cmp      Main                    137    16004000    4.2    0.0     4.2    0.0
      updateFP Main                    136    16004000    1.6    0.0     1.6    0.0
      findYP   Main                    135    16004000    3.5   19.4     3.5   19.4
   newVI1      Main                    130           1    0.0    0.0     0.0    0.0
   newVI1.\   Main                    131        8004    0.0    0.0     0.0    0.0
 CAF           GHC.Conc.Signal         112           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding         104           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding.Iconv   102           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Handle.FD         95           0    0.0    0.0     0.0    0.0

如果我正确解释分析输出（并假设分析没有太多失真），gridWalk占用大部分时间，但主要功能cmp和{{1}在findYP中进行繁重的工作，似乎在分析报告中花费的时间很少。那么，也许瓶颈在gridWalk封装器forM_函数用来调用findSnakes？堆配置文件看起来也很干净： Heap profile

阅读核心，没有什么能真正跳出来。我认为内部循环中的某些值可能是盒装的，但我没有在核心中发现它们。我希望性能问题是由于我错过了一些简单的事情。

更新

根据@DanielFischer的建议，我将gridWalk的{{1}}替换为forM_ Data.Vector.Unboxed Control.Monad的{{1}}，将性能从4x提高到2.5x的C版本。如果你想试用它们，Haskell和C版本现在发布here。

我仍在挖掘核心，看看瓶颈在哪里。 findSnakes最常被称为函数，为了使其表现良好，gridWalk应该将lcsh循环减少到条件检查和内联whileM_代码的良好迭代内循环。我怀疑在汇编时，findSnakes循环不是这种情况，但由于我对于转换核心并在汇编中找到名称错误的GHC函数不是很了解，我想这只是耐心堵塞的问题解决问题，直到我搞清楚。同时，如果有关于性能修复的任何指示，他们将不胜感激。

我能想到的另一种可能性是函数调用期间堆检查的开销。如分析报告中所示，whileM_被称为16004000次。假设有6个循环进行堆检查（我猜它少了，但我们还是假设它），在3.33GHz的盒子上为96024000个循环时间为~0.02秒。

此外，还有一些表现数字：

gridWalk：在Haskell code (GHC 7.6.1 x86_64)修复之前约为0.25秒。

forM_

time ./T 1 real 0m0.150s user 0m0.145s sys 0m0.003s：

C code (gcc 4.7.2 x86_64)

更新2：

更新的代码为here。使用time ./test 1 real 0m0.065s user 0m0.063s sys 0m0.000s也不会更改数字。 STUArray的性能约为1.5倍，与@DanielFischer在Linux上的报告非常相似。

Haskell代码：

Mac OS X (x86_64,ghc7.6.1)

C代码：

$ time ./Diff
1

real    0m0.087s
user    0m0.084s
sys 0m0.003s

纵观$ time ./test 1 real 0m0.056s user 0m0.053s sys 0m0.002s，调用是尾递归的，并由cmm转换为循环。但是每个新的迭代似乎都会分配新的值来调用堆检查，因此，可以解释性能上的差异。我必须考虑如何以这样的方式编写尾递归，这样就不会在迭代中分配任何值，从而避免堆检查和分配开销。

Answer 1

你受到了巨大打击

U.forM_ (U.fromList [0..ct-1])

findSnakes中的

。我确信不应该发生（票证？），但是每次调用Vector时都会分配一个新的findSnakes来遍历。如果你使用

Control.Monad.forM_ [0 .. ct-1]

相反，运行时间大致减半，这里的分配下降了大约500倍。（GHC优化C.M.forM_ [0 :: Int .. limit]，列表被删除，剩下的基本上是一个循环。）你可以通过自己编写循环来更好地稍微。

导致无端分配/代码大小膨胀而不会损害性能的一些事情

Bool

lcsh

cmp参数findSnakes和gridWalk;如果从未使用与顶级cmp不同的比较来调用它们，则该参数会导致不必要的代码重复。
while的一般类型;将它专门化为使用的类型ST s Bool -> ST s () -> ST s ()会减少分配（很多），也会减少运行时间（稍微但很明显，这里）。

关于性能分析的一般说法：编译用于性能分析的程序会抑制许多优化。特别是对于大量使用融合的vector，bytestring或text等库，分析通常会产生误导性的结果。

例如，您的原始代码在此生成

    total time  =        3.42 secs   (3415 ticks @ 1000 us, 1 processor)
    total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.7   52.7   2176 2432608000
findSnakes  Main       20.0   27.8    682 1281440080
cmp         Main        9.2    0.0    313        16
findYP      Main        4.2   19.4    144 896224000
updateFP    Main        2.7    0.0     91         0

在len中添加gridWalk的绑定只会在非分析版本中发生任何变化，但对于分析版本

    total time  =        2.98 secs   (2985 ticks @ 1000 us, 1 processor)
    total alloc = 3,204,404,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.0   32.0   1881 1024256000
findSnakes  Main       22.2   40.0    663 1281440080
cmp         Main        7.2    0.0    214        16
findYP      Main        4.7   28.0    140 896224000
updateFP    Main        2.7    0.0     82         0

它有很大的不同。对于包含上述更改的版本（以及len中的gridWalk上的爆炸），分析版本说明

total alloc = 1,923,412,776 bytes  (excludes profiling overheads)

但非分析版

     1,814,424 bytes allocated in the heap
        10,808 bytes copied during GC
        49,064 bytes maximum residency (2 sample(s))
        25,912 bytes maximum slop
             1 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time (elapsed)  Avg pause  Max pause
Gen  0         2 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.12s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.12s  (  0.12s elapsed)

表示它的分配比分析版本少1000倍。

对于vector和朋友代码，识别瓶颈比分析更可靠（不幸的是，更加耗时和困难）正在研究生成的核心（或者汇编，如果你精通阅读）。

关于更新，我的盒子上的C运行速度稍慢（gcc-4.7.2，-O3）

$ time ./miltest1

real    0m0.074s
user    0m0.073s
sys     0m0.001s

但Haskell大致相同

$ time ./hsmiller
1

real    0m0.151s
user    0m0.149s
sys     0m0.001s

通过LLVM后端进行编译时，速度会快一些：

$ time ./hsmiller1

real    0m0.131s
user    0m0.129s
sys     0m0.001s

当我们用手动循环替换forM_时，

findSnakes a b fp !k !ct op = go 0
  where
    go x
        | x < ct    = gridWalk a b fp (op k x) >> go (x+1)
        | otherwise = return ()

它变得快一点，

$ time ./hsmiller
1

real    0m0.124s
user    0m0.121s
sys     0m0.002s

RESP。通过LLVM：

$ time ./hsmiller
1

real    0m0.108s
user    0m0.107s
sys     0m0.000s

总的来说，生成的核心看起来很好，一个小烦恼就是

Main.$wa
  :: forall s.
     GHC.Prim.Int#
     -> GHC.Types.Int
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s, Main.MVI1 s #)

并略微迂回实施。通过在第二个参数中使newVI1严格，

来解决这个问题

newVI1 n !x = do

由于不经常调用，因此对性能的影响当然可以忽略不计。

肉是lcsh的核心，看起来并不太糟糕。唯一的盒装内容是Int读取/写入STRef，这是不可避免的。不太令人愉快的是，核心包含大量的代码重复，但根据我的经验，这很少是一个真正的性能问题，并不是所有重复的代码都能在代码生成中存活下来。

为了使其表现良好，lcsh应该将whileM_循环减少到条件检查和内联findSnakes代码的良好迭代内循环。

当你向INLINE添加whileM_ pragma时，你会得到一个内部循环，但是这个循环并不好，在这种情况下，它比whileM_更慢-of-line（我不确定它是否仅仅是由于代码大小，但它可能是）。

调试最长公共子序列算法的性能瓶颈

1 个答案: