我在Haskell和Scala中有一段非常简单的代码。此代码旨在以非常紧凑的循环运行,因此性能很重要。问题是Haskell比Scala慢大约10倍。这是Haskell代码。
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as VU
newtype AffineTransform = AffineTransform {get :: (VU.Vector Double)} deriving (Show)
{-# INLINE runAffineTransform #-}
runAffineTransform :: AffineTransform -> (Double, Double) -> (Double, Double)
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2,
get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)
testAffineTransformSpeed :: AffineTransform -> Int -> (Double, Double)
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
where go :: Int -> (Double, Double) -> (Double, Double)
go 0 res = res
go !n !res = go (n-1) (runAffineTransform affTr res)
还可以采取哪些措施来改进此代码?
答案 0 :(得分:9)
我定义了以下严格/未装箱对类型:
import System.Random.MWC -- for later
import Control.DeepSeq
data SP = SP {
one :: {-# UNPACK #-} !Double
, two :: {-# UNPACK #-} !Double
} deriving Show
instance NFData SP where
rnf p = rnf (one p) `seq` rnf (two p) `seq` ()
并将其替换为runAffineTransform
函数:
runAffineTransform2 :: AffineTransform -> SP -> SP
runAffineTransform2 affTr !(SP x y) =
SP ( get affTr `U.unsafeIndex` 0 * x
+ get affTr `U.unsafeIndex` 1 * y
+ get affTr `U.unsafeIndex` 2 )
( get affTr `U.unsafeIndex` 3 * x
+ get affTr `U.unsafeIndex` 4 * y
+ get affTr `U.unsafeIndex` 5 )
{-# INLINE runAffineTransform2 #-}
然后运行这个基准测试套件:
main :: IO ()
main = do
g <- create
zs <- fmap (AffineTransform . U.fromList)
(replicateM 100000 (uniformR (0 :: Double, 1) g))
let myConfig = defaultConfig { cfgPerformGC = ljust True }
defaultMainWith myConfig (return ()) [
bench "yours" $ nf (testAffineTransformSpeed zs) 10
, bench "mine" $ nf (testAffineTransformSpeed2 zs) 10
]
使用-O2
编译并运行,并观察到一些(~4x)加速:
benchmarking yours
mean: 257.4559 ns, lb 256.2492 ns, ub 258.9761 ns, ci 0.950
std dev: 6.889905 ns, lb 5.688330 ns, ub 8.839753 ns, ci 0.950
found 5 outliers among 100 samples (5.0%)
3 (3.0%) high mild
2 (2.0%) high severe
variance introduced by outliers: 20.944%
variance is moderately inflated by outliers
benchmarking mine
mean: 69.56408 ns, lb 69.29910 ns, ub 69.86838 ns, ci 0.950
std dev: 1.448874 ns, lb 1.261444 ns, ub 1.718074 ns, ci 0.950
found 4 outliers among 100 samples (4.0%)
4 (4.0%) high mild
variance introduced by outliers: 14.190%
variance is moderately inflated by outliers
完整代码位于要点here。
修改强>
我还发布了标准的输出报告here。
答案 1 :(得分:8)
主要问题是
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x
+ get affTr `VU.unsafeIndex` 1 * y
+ get affTr `VU.unsafeIndex` 2,
get affTr `VU.unsafeIndex` 3 * x
+ get affTr `VU.unsafeIndex` 4 * y
+ get affTr `VU.unsafeIndex` 5)
生成一对 thunks 。调用runAffineTransform
时不会评估组件,它们会保持不变,直到某些消费者要求对它们进行评估。
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
where go :: Int -> (Double, Double) -> (Double, Double)
go 0 res = res
go !n !res = go (n-1) (runAffineTransform affTr res)
不是那个消费者,res
上的爆炸只会将它评估到最外层的构造函数(,)
,并且得到
runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))
仅在最后评估时,最终需要正常形式。
如果强制立即评估结果的组成部分,
runAffineTransform affTr (!x, !y) = case
( get affTr `U.unsafeIndex` 0 * x
+ get affTr `U.unsafeIndex` 1 * y
+ get affTr `U.unsafeIndex` 2
, get affTr `U.unsafeIndex` 3 * x
+ get affTr `U.unsafeIndex` 4 * y
+ get affTr `U.unsafeIndex` 5
) of (!a,!b) -> (a,b)
并且让它内联,使用自定义严格的一对未装箱Double#
的{{3}}版本的主要区别在于testAffineTransformSpeed
中的循环得到一个初始迭代使用盒装的Double
作为参数,最后,结果的组件被加框,这会增加一些不变的开销(我的盒子上每个循环大约5纳秒)。在两种情况下,循环的主要部分都采用Int#
和两个Double#
参数,除了到达n = 0
时的装箱外,循环体是相同的。
当然,使用未装箱的严格对类型强制立即评估组件是更好的。