Question

我使用Haskell实现了一个简单的L1距离计算器。由于我对性能感兴趣，因此我使用未装箱的矢量来存储要比较的图像。

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

据我所知（我是Haskell的新手），流融合应该使这个代码作为一个简单的循环运行。所以它应该很快。但是，使用

编译时，性能结果很低

ghc -O -fforce-recomp -rtsopts -o test .\performance.hs

该计划耗时约60秒：

 198,871,911,896 bytes allocated in the heap
   1,804,017,536 bytes copied during GC
     254,900,000 bytes maximum residency (14 sample(s))
       9,020,888 bytes maximum slop
             579 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     378010 colls,     0 par    2.312s   2.949s     0.0000s    0.0063s
  Gen  1        14 colls,     0 par    0.562s   0.755s     0.0539s    0.2118s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   58.297s  ( 64.380s elapsed)
  GC      time    2.875s  (  3.704s elapsed)
  EXIT    time    0.016s  (  0.088s elapsed)
  Total   time   61.188s  ( 68.176s elapsed)

  %GC     time       4.7%  (5.4% elapsed)

  Alloc rate    3,411,364,878 bytes per MUT second

  Productivity  95.3% of total user, 94.6% of total elapsed

但是，使用

进行编译时，性能会大幅提升

ghc -O2 -fforce-recomp -rtsopts -o test .\performance.hs

运行时间降至13秒左右：

   2,261,672,056 bytes allocated in the heap
   1,571,668,904 bytes copied during GC
     241,064,192 bytes maximum residency (12 sample(s))
       8,839,048 bytes maximum slop
             544 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2951 colls,     0 par    1.828s   1.927s     0.0007s    0.0059s
  Gen  1        12 colls,     0 par    0.516s   0.688s     0.0573s    0.2019s

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time   10.484s  ( 16.598s elapsed)
  GC      time    2.344s  (  2.615s elapsed)
  EXIT    time    0.000s  (  0.105s elapsed)
  Total   time   12.828s  ( 19.324s elapsed)

  %GC     time      18.3%  (13.5% elapsed)

  Alloc rate    215,718,348 bytes per MUT second

  Productivity  81.7% of total user, 86.4% of total elapsed

当使用较大部分的图像集时，效果会更强，因为图像加载占用了较小的运行时间。根据HaskellWiki，-O和-O2（https://wiki.haskell.org/Performance/GHC）之间实际上几乎没有区别。但是，我观察到了巨大的影响。我想知道我是否遗漏了什么。我是否应该对编译器（GHC）在使用-O2进行编译时所做的代码进行任何优化？如果是的话，他做了什么？从我读到的，主要的性能改进来自流融合，从我的功能看起来像流融合可以应用。

供参考，以下是我的测试程序的完整示例。

import Data.List
import Data.Word
import qualified Data.ByteString as ByteStr
import qualified Data.ByteString.Char8 as ByteStrCh8
import qualified Data.Vector.Unboxed as Vec

data LabeledImage = LabeledImage {
       labelIdx :: Int
     , pixels :: Vec.Vector Int
} deriving (Eq)

extractLabeledImages :: ByteStr.ByteString -> [LabeledImage] -> [LabeledImage]
extractLabeledImages source images
      | ByteStr.length source >= imgLength =
                    let
                      (label,trailData) = ByteStr.splitAt labelBytes source
                      (rgbData,remainingData) = ByteStr.splitAt colorBytes trailData
                      numLabel = fromIntegral (ByteStr.head label)
                      pixelValues = Vec.generate (ByteStr.length rgbData) (fromIntegral . ByteStr.index rgbData)
                    in
                      extractLabeledImages remainingData (images ++ [LabeledImage numLabel pixelValues])
      | otherwise = images
      where
        labelBytes = 1
        colorBytes = 3072
        imgLength = labelBytes + colorBytes

calculateL1Distance :: LabeledImage -> LabeledImage -> Int
calculateL1Distance reference test = 
            let
              substractPixels :: Int -> Int -> Int
              substractPixels a b = abs $ a - b
              diff f = Vec.sum $ Vec.zipWith substractPixels (f reference) (f test)
            in
              diff pixels

main = do
       batch1Raw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\data_batch_1.bin"
       testBatchRaw <- ByteStr.readFile "M:\\Documents\\StanfordCNN\\cifar10\\test_batch.bin"

       let referenceImages = take 1000 $ extractLabeledImages batch1Raw []
       let testImages = take 1000 $ extractLabeledImages testBatchRaw []

       putStrLn "Created image sets. Starting tests."
       let results = [calculateL1Distance referenceImage testImage | referenceImage <- referenceImages, testImage <- testImages ]
       ByteStr.writeFile "M:\\Documents\\StanfordCNN\\results.txt" (ByteStrCh8.pack $ show results)

为什么-O2对Haskell中的简单L1距离计算器有如此大的影响？

0 个答案: