Question

我有一个应用程序花费大约80％的时间使用Kahan summation algorithm计算大型列表（10 ^ 7）的高维向量（dim = 100）的质心。我已经尽力优化求和，但它仍然比同等的C实现慢20倍。分析表明罪犯是来自unsafeRead的{{1}}和unsafeWrite函数。我的问题是：这些功能真的很慢还是我误解了性能分析统计数据？

以下是两个实现。使用llvm后端使用ghc-7.0.3编译Haskell。 C one用llvm-gcc编译。

哈斯克尔的卡汉总结：

Data.Vector.Unboxed.Mutable

使用llvm后端使用ghc-7.0.3进行编译：

{-# LANGUAGE BangPatterns #-}
module Test where

import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )

prng :: Word -> Word
prng w = w' where
    !w1 = w  `xor` (w  `shiftL` 13)
    !w2 = w1 `xor` (w1 `shiftR` 7)
    !w' = w2 `xor` (w2 `shiftL` 17)

mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng

foldV :: (Unbox a, Unbox b) 
      => (a -> b -> a) -- componentwise function to fold
      -> Vector a      -- initial accumulator value
      -> [Vector b]    -- data vectors
      -> Vector a      -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
    liftV f acc = fV where
        fV v = go 0 where
            n = min (U.length v) (UM.length acc)
            go i | i < n     = step >> go (i + 1)
                 | otherwise = return ()
                 where
                     step = {-# SCC "fV_step" #-} do
                         a <- {-# SCC "fV_read"  #-} UM.unsafeRead acc i
                         b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
                         {-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b

kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
    acc = U.map (\z -> (z, 0.0)) v

kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
    !y  = x - c
    !s' = s + y
    !c' = (s' - s) - y
{-# NOINLINE kahanStep #-}

zero :: U.Vector Double
zero = U.replicate 100 0.0

myLoop n = kahan $ map mkVect [1..n]

main = print $ myLoop 100000

分析信息：

ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs

time ./Test_hs
real    0m1.948s
user    0m1.936s
sys     0m0.008s

memory residency by cost center memory residency by type for <code>foldV</code>

C中的等效实现：

16,710,594,992 bytes allocated in the heap
      33,047,064 bytes copied during GC
          35,464 bytes maximum residency (1 sample(s))
          23,888 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 31907 collections,     0 parallel,  0.28s,  0.27s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   24.73s  ( 24.74s elapsed)
  GC    time    0.28s  (  0.27s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   25.01s  ( 25.02s elapsed)

  %GC time       1.1%  (1.1% elapsed)

  Alloc rate    675,607,179 bytes per MUT second

  Productivity  98.9% of total user, 98.9% of total elapsed

    Thu Feb 23 02:42 2012 Time and Allocation Profiling Report  (Final)

       Test_hs +RTS -s -p -RTS

    total time  =       24.60 secs   (1230 ticks @ 20 ms)
    total alloc = 8,608,188,392 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

fV_write                       Test                  31.1   26.0
fV_read                        Test                  27.2   23.2
mkVect                         Test                  12.3   27.2
fV_step                        Test                  11.7    0.0
foldV                          Test                   5.9    5.7
fV_index                       Test                   5.2    9.3
kahanStep                      Test                   3.3    6.5
prng                           Test                   2.2    1.8


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
 CAF:main1               Test                                                 339           1   0.0    0.0     0.0    0.0
  main                   Test                                                 346           1   0.0    0.0     0.0    0.0
 CAF:main2               Test                                                 338           1   0.0    0.0   100.0  100.0
  main                   Test                                                 347           0   0.0    0.0   100.0  100.0
   myLoop                Test                                                 348           1   0.2    0.2   100.0  100.0
    mkVect               Test                                                 350      400000  12.3   27.2    14.5   29.0
     prng                Test                                                 351     9900000   2.2    1.8     2.2    1.8
    kahan                Test                                                 349         102   0.0    0.0    85.4   70.7
     foldV               Test                                                 359           1   5.9    5.7    85.4   70.7
      fV_step            Test                                                 360     9999900  11.7    0.0    79.5   65.1
       fV_write          Test                                                 367    19999800  31.1   26.0    35.4   32.5
        fV_apply         Test                                                 368     9999900   1.0    0.0     4.3    6.5
         kahanStep       Test                                                 369     9999900   3.3    6.5     3.3    6.5
       fV_index          Test                                                 366     9999900   5.2    9.3     5.2    9.3
       fV_read           Test                                                 361     9999900  27.2   23.2    27.2   23.2
 CAF:lvl19_r3ei          Test                                                 337           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 358           0   0.0    0.0     0.0    0.0
 CAF:poly_$dPrimMonad3_r3eg Test                                                 336           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 357           0   0.0    0.0     0.0    0.0
 CAF:$dMVector2_r3ee     Test                                                 335           1   0.0    0.0     0.0    0.0
 CAF:$dVector1_r3ec      Test                                                 334           1   0.0    0.0     0.0    0.0
 CAF:poly_$dMonad_r3ea   Test                                                 333           1   0.0    0.0     0.0    0.0
 CAF:$dMVector1_r3e2     Test                                                 330           1   0.0    0.0     0.0    0.0
 CAF:poly_$dPrimMonad2_r3e0 Test                                                 328           1   0.0    0.0     0.0    0.0
  foldV                  Test                                                 365           0   0.0    0.0     0.0    0.0
 CAF:lvl11_r3dM          Test                                                 322           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 354           0   0.0    0.0     0.0    0.0
 CAF:lvl10_r3dK          Test                                                 321           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 355           0   0.0    0.0     0.0    0.0
 CAF:$dMVector_r3dI      Test                                                 320           1   0.0    0.0     0.0    0.0
  kahan                  Test                                                 356           0   0.0    0.0     0.0    0.0
 CAF                     GHC.Float                                            297           1   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Handle.FD                                     256           2   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Encoding.Iconv                                214           2   0.0    0.0     0.0    0.0
 CAF                     GHC.Conc.Signal                                      211           1   0.0    0.0     0.0    0.0
 CAF                     Data.Vector.Generic                                  182           1   0.0    0.0     0.0    0.0
 CAF                     Data.Vector.Unboxed                                  174           2   0.0    0.0     0.0    0.0

使用llvm-gcc编译：

#include <stdint.h>
#include <stdio.h>


#define VDIM    100
#define VNUM    100000



uint64_t prng (uint64_t w) {
    w ^= w << 13;
    w ^= w >> 7;
    w ^= w << 17;
    return w;
};

void kahanStep (double *s, double *c, double x) {
    double y, t;
    y  = x - *c;
    t  = *s + y;
    *c = (t - *s) - y;
    *s = t;
}

void kahan(double s[], double c[]) {
    for (int i = 1; i <= VNUM; i++) {
        uint64_t w = i;
        for (int j = 0; j < VDIM; j++) {
                kahanStep(&s[j], &c[j], w);
                w = prng(w);
        }
    }
};


int main (int argc, char* argv[]) {
    double acc[VDIM], err[VDIM];
    for (int i = 0; i < VDIM; i++) {
        acc[i] = err[i] = 0.0;
    };
    kahan(acc, err);
    printf("[ ");
    for (int i = 0; i < VDIM; i++) {
        printf("%g ", acc[i]);
    };
    printf("]\n");
};

更新1：我在C版本中未内联>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c >time ./Test_c real 0m0.096s user 0m0.088s sys 0m0.004s。它几乎没有削弱性能。我希望现在我们都能承认阿姆达尔的法律并继续前进。如效率低kahanStep，kahanStep和unsafeRead慢9-10倍。我希望有人可以对这一事实的可能原因有所了解。

此外，我应该说，因为我正在与使用unsafeWrite的图书馆进行交互，所以我现在有点嫁给它，并且与它分开将是非常痛苦的： - ）

更新2：我想我原来的问题不够清楚。我不是在寻找加速这个微基准测试的方法。我正在寻找计数器直观性能分析统计数据的解释，因此我可以决定是否针对Data.Vector.Unboxed提交错误报告。

Answer 1

您的C版本不等同于您的Haskell实现。在C中，您自己在Kaaskell中总结了重要的Kahan求和步骤，您在Haskell中创建了一个多态高阶函数，它可以执行更多操作并将转换步骤作为参数。将kahanStep移动到C中的单独函数不是该点，它仍将由编译器内联。即使你将它放入自己的源文件中，单独编译并链接而没有链接时优化，你只能解决部分差异。

我已经制作了一个更接近Haskell版本的C版本，

kahan.h：

typedef struct DPair_st {
    double fst, snd;
    } DPair;

DPair kahanStep(DPair pr, double x);

kahanStep.c：

#include "kahan.h"

DPair kahanStep (DPair pr, double x) {
    double y, t;
    y  = x - pr.snd;
    t  = pr.fst + y;
    pr.snd = (t - pr.fst) - y;
    pr.fst = t;
    return pr;
}

main.c中：

#include <stdint.h>
#include <stdio.h>
#include "kahan.h"


#define VDIM    100
#define VNUM    100000

uint64_t prng (uint64_t w) {
    w ^= w << 13;
    w ^= w >> 7;
    w ^= w << 17;
    return w;
};

void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
    for (int i = 1; i <= VNUM; i++) {
        uint64_t w = i;
        for (int j = 0; j < VDIM; j++) {
            DPair pr;
            pr.fst = s[j];
            pr.snd = c[j];
            pr = fun(pr,w);
            s[j] = pr.fst;
            c[j] = pr.snd;
            w = prng(w);
        }
    }
};


int main (int argc, char* argv[]) {
    double acc[VDIM], err[VDIM];
    for (int i = 0; i < VDIM; i++) {
        acc[i] = err[i] = 0.0;
    };
    kahan(acc, err,kahanStep);
    printf("[ ");
    for (int i = 0; i < VDIM; i++) {
        printf("%g ", acc[i]);
    };
    printf("]\n");
};

单独编译并链接，比第一个C版本慢了约25％（0.1秒对0.079秒）。

现在你在C中有一个更高阶的函数，比原来慢得多，但仍然比Haskell代码快得多。一个重要的区别是C函数使用一对未装箱的double s和一个未装箱的double作为参数，而Haskell kahanStep采用一对盒装的Double s和一个盒装的Double并返回一盒装箱的Double，需要在foldV循环中进行昂贵的装箱和拆箱。这可以通过更多内联来解决。使用ghc-7.0.4明确地内联foldV，kahanStep和step会将时间从0.90秒降低到0.74秒（对ghc-7.4.1的影响较小＆## 39;输出，从0.99秒到0.90秒。

但是，拳击和拆箱是差异的一小部分。 foldV比C kahan做的更多，它需要用于修改累加器的向量列表。 C代码中完全没有该向量列表，这会产生很大的不同。所有这些100000个向量必须被分配，填充并放入一个列表中（由于懒惰，并非所有这些都同时存在，所以没有空间问题，但它们以及列表单元格必须被分配和垃圾收集，这需要相当长的时间）。在正确的循环中，不是在寄存器中传递Word#，而是从向量中读取预先计算的值。

如果您使用更直接的C转换为Haskell，

{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where

#define VDIM 100
#define VNUM 100000

import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits

prng :: Word -> Word
prng w = w'
  where
    !w1 = w `xor` (w `shiftL` 13)
    !w2 = w1 `xor` (w1 `shiftR` 7)
    !w' = w2 `xor` (w2 `shiftL` 17)

type Vec s = STUArray s Int Double

kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
    let inner w j
            | j < VDIM  = do
                !cj <- unsafeRead c j
                !sj <- unsafeRead s j
                let !y = fromIntegral w - cj
                    !t = sj + y
                    !w' = prng w
                unsafeWrite c j ((t-sj)-y)
                unsafeWrite s j t
                inner w' (j+1)
            | otherwise = return ()
    forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0

calc :: ST s (Vec s)
calc = do
    s <- newArray (0,VDIM-1) 0
    c <- newArray (0,VDIM-1) 0
    kahan s c
    return s

main :: IO ()
main = print . elems $ runSTUArray calc

它的速度要快得多。不可否认，它仍然比C慢三倍，但原来这个速度慢了13倍（而且我没有安装llvm，所以我使用vanilla gcc和GHC的原生支持，使用llvm可能给出略有不同的结果）。

我不认为索引真的是罪魁祸首。向量包很大程度上依赖于编译器魔法，但编译分析支持会严重干扰它。对于使用自己的融合框架进行优化的vector或bytestring这样的包，分析干扰可能相当灾难性，并且分析结果完全没用。我倾向于相信我们在这里有这样的情况。

在Core中，所有读取和写入都转换为初始值readDoubleArray#，indexDoubleArray#和writeDoubleArray#，其中快。可能比C阵列访问慢一点，但不是很多。因此，我确信这不是问题，也不是造成巨大差异的原因。但是您已经在其上添加了{-# SCC #-}注释，因此禁用任何涉及重新排列任何这些术语的优化。每次输入其中一个点时，都必须进行记录。我对剖析器和优化器不太熟悉，无法知道究竟发生了什么，但是，作为数据点，{-# INLINE #-}上的foldV pragma，step和{{1这些SCC的分析运行时间为3.17秒，SCC kahanStep，fV_step，fV_read，fV_index和fV_write已删除（没有其他内容）更改）分析运行仅花费2.03s（fV_apply报告的两次，因此减去分析开销）。这种差异表明SCC对廉价函数和过细粒度的SCC可以大大扭曲分析结果。现在，如果我们还在+RTS -P，{-# INLINE #-}和mkVect上添加了kahan个pragma，我们会留下一个完全无法提供信息的配置文件，但运行时间仅为1.23秒。（但是，这些最后的内联对非分析运行没有影响，如果没有分析，它们会自动内联。）

所以，不要将剖析结果视为无可置疑的事实。您的代码（直接或间接通过所使用的库）依赖于优化越多，它就越容易受到由禁用的优化引起的误导性分析结果的影响。这也适用于堆分析以减少空间泄漏，但程度要小得多。

如果您有可疑的性能分析结果，请检查删除某些SCC时会发生什么。如果这导致运行时间大幅下降，则SCC不是您的主要问题（在修复其他问题后可能会再次成为问题）。

查看为您的计划生成的核心，跳出来的是您的prng - 顺便说一句，从中移除kahanStep pragma，它会适得其反 - 产生了循环中装箱的一对盒装{-# NOINLINE #-}，立即解构并且组件未装箱。这种不必要的值的中间装箱是昂贵的，并且大大减慢了计算速度。

今天再次出现haskell-cafe，其中有人用ghc-7.4.1从上面的代码中获得了糟糕的表现，tibbe自己去调查GHC产生的核心并发现GHC为从Double到Word的转换生成了次优代码。仅使用（包装的）原语替换自定义转换的Double转换（并删除不会在这里产生影响的爆炸模式，GHC的严格分析器足以让人看透算法，我应该学会更多地信任它;），我们获得的版本与原始C的fromIntegral输出相同：

gcc -O3

Answer 2

在所有这些看似Data.Vector的代码中，列表组合器中有一个有趣的混合。如果我做出第一个明显的修正，替换

mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng

正确使用Data.Vector.Unboxed：

mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng

然后我的时间减少了三分之二 - 从real 0m1.306s到real 0m0.429s看起来所有顶级函数都有此问题，除了prng和zero

Answer 3

这出现在邮件列表中，我发现GHC 7.4.1中的Word-＆gt; Double转换代码中存在一个错误（至少）。这个版本适用于bug，与我机器上的C代码一样快：

{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where

#define VDIM 100
#define VNUM 100000

import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word

import GHC.Exts

prng :: Word -> Word
prng w = w'
  where
    w1 = w `xor` (w `shiftL` 13)
    w2 = w1 `xor` (w1 `shiftR` 7)
    w' = w2 `xor` (w2 `shiftL` 17)

type Vec s = STUArray s Int Double

kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
    let inner !w j
            | j < VDIM  = do
                cj <- unsafeRead c j
                sj <- unsafeRead s j
                let y = word2Double w - cj
                    t = sj + y
                    w' = prng w
                unsafeWrite c j ((t-sj)-y)
                unsafeWrite s j t
                inner w' (j+1)
            | otherwise = return ()

        outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
                | otherwise = return ()
    outer (1 :: Int)

calc :: ST s (Vec s)
calc = do
    s <- newArray (0,VDIM-1) 0
    c <- newArray (0,VDIM-1) 0
    kahan s c
    return s

main :: IO ()
main = print . elems $ runSTUArray calc

{- I originally used this function, which isn't quite correct.
   We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}

correction :: Double
correction = 2 * int2Double minBound

word2Double :: Word -> Double
word2Double w = case fromIntegral w of
                   i | i < 0 -> int2Double i - correction
                     | otherwise -> int2Double i

除了解决Word-＆gt; Double bug之外，我还删除了额外的列表以更好地匹配C版本。

Answer 4

我知道你没有要求改进这个微基准的方法，但我会给你一个解释，在将来编写循环时可能会有所帮助：

未知函数调用（例如对foldV的高阶参数进行的调用）在循环中频繁完成时可能很昂贵。特别是，它将禁止取消装箱函数参数，从而增加分配。它禁止参数拆箱的原因是我们不知道我们所调用的函数在这些参数中是严格的，因此我们将参数传递为例如(Double, Double)，而不是Double# -> Double#。

如果循环（例如foldV）满足循环体（例如kahanStep），编译器可以计算出严格性信息。出于这个原因，我建议人INLINE高阶函数。在这种情况下，内联foldV并删除NOINLINE上的kahanStep可以为我提供相当多的运行时间。

在这种情况下，这并没有使C的性能与C相提并论，因为还有其他事情正在发生（正如其他人所评论的那样），但这是朝着正确方向迈出的一步（这是你可以做到的一步）每个人都要查看分析输出。）

Data.Vector.Unboxed.Mutable.MVector的索引真的很慢吗？

4 个答案: