在对previous question的评论中,我声称:
我有另一个基准来指示ghc-7.4.1 + llvm在严格的数据字段上解包枚举。
事实上,经过一些实验,我相信,至少在一些简单的案例中, 即使在数据类型的严格字段中使用枚举,使用枚举至少与使用新类型的Word8(或Int)一样快,实际上可能更高效(因此在更真实的应用程序中更快)。正如我在上一个问题中所说,我在更现实(但仍然很小)的环境中经历了类似的现象。
有人能指出一些关于ghc / llvm对枚举的优化的相关参考吗?特别是,它是否真的解压缩严格数据字段上的枚举的内部标记?程序集输出和分析结果似乎表明情况就是这样,但对我来说,核心级别没有任何迹象。任何见解将不胜感激。
还有一个问题: 枚举始终至少与相应Integral的newtypes一样有效,使用它们是否有意义? (请注意,枚举也可以像Integrals一样。)如果不是,那么(希望是否真实有用)异常是什么? Daniel Fischer在his answer中建议将枚举放在多构造函数数据类型的严格字段上可能会阻止某些优化。但是,我无法在双构造函数的情况下验证这一点。将它们放在大型多构造函数数据类型中时可能有区别吗?
benchmarking d mean: 11.09113 ns, lb 11.06140 ns, ub 11.17545 ns, ci 0.950 std dev: 234.6722 ps, lb 72.31532 ps, ub 490.1156 ps, ci 0.950 benchmarking e mean: 11.54242 ns, lb 11.51789 ns, ub 11.59720 ns, ci 0.950 std dev: 178.8556 ps, lb 73.05290 ps, ub 309.0252 ps, ci 0.950 benchmarking s mean: 11.74964 ns, lb 11.52543 ns, ub 12.50447 ns, ci 0.950 std dev: 1.803095 ns, lb 207.2720 ps, ub 4.029809 ns, ci 0.950 benchmarking t mean: 11.89797 ns, lb 11.86266 ns, ub 11.99105 ns, ci 0.950 std dev: 269.5795 ps, lb 81.65093 ps, ub 533.8658 ps, ci 0.950 OK,so the enumeration appears at least no less efficient than the newtype Next,heap profiles of the function heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x data D = A | B | C: 10,892,604 bytes allocated in the heap 6,401,260 bytes copied during GC 1,396,092 bytes maximum residency (3 sample(s)) 55,940 bytes maximum slop 6 MB total memory in use (0 MB lost due to fragmentation) Productivity 47.8% of total user, 35.4% of total elapsed newtype E = E Word8: 11,692,768 bytes allocated in the heap 8,909,632 bytes copied during GC 2,779,776 bytes maximum residency (3 sample(s)) 92,464 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Productivity 36.9% of total user, 33.8% of total elapsed data S = S !D: 10,892,736 bytes allocated in the heap 6,401,260 bytes copied during GC 1,396,092 bytes maximum residency (3 sample(s)) 55,940 bytes maximum slop 6 MB total memory in use (0 MB lost due to fragmentation) Productivity 48.7% of total user, 33.3% of total elapsed data T = T {-# UNPACK #-} !E: 11,692,968 bytes allocated in the heap 8,909,640 bytes copied during GC 2,779,760 bytes maximum residency (3 sample(s)) 92,536 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Productivity 36.1% of total user, 31.6% of total elapsed
{-# LANGUAGE CPP,MagicHash , BangPatterns ,GeneralizedNewtypeDeriving #-}
module Main(main,d,e,s,t,D(..),E(..),S(..),T(..))
import GHC.Base
import Data.List
import Data.Word
import Control.DeepSeq
import Criterion.Main
data D = A | B | C deriving(Eq,Ord,Show,Enum,Bounded)
newtype E = E Word8 deriving(Eq,Ord,Show,Enum)
data S = S !D deriving (Eq,Ord,Show)
data T = T {-# UNPACK #-} !E deriving (Eq,Ord,Show)
-- I assume the following definitions are all correct --- otherwise
-- the whole benchmark may be useless
instance NFData D where
rnf !x = ()
instance NFData E where
rnf (E !x) = ()
instance NFData S where
rnf (S !x) = ()
instance NFData T where
rnf (T (E !x)) = ()
instance Enum S where
toEnum = S . toEnum
fromEnum (S x) = fromEnum x
instance Enum T where
toEnum = T . toEnum
fromEnum (T x) = fromEnum x
instance Bounded E where
minBound = E 0
maxBound = E 2
instance Bounded S where
minBound = S minBound
maxBound = S maxBound
instance Bounded T where
minBound = T minBound
maxBound = T maxBound
succ' :: (Eq a,Enum a,Bounded a) => a -> a
succ' x | x == maxBound = minBound
| otherwise = succ x
-- Those numbers below are for easy browsing of the assembly code
d :: D -> Int#
d x = case x of
A -> 1234#
B -> 5678#
C -> 9412#
e :: E -> Int#
e x = case x of
E 0 -> 1357#
E 1 -> 2468#
E _ -> 9914#
s :: S -> Int#
s x = case x of
S A -> 9876#
S B -> 5432#
S C -> 1097#
t :: T -> Int#
t x = case x of
T (E 0) -> 9630#
T (E 1) -> 8529#
T (E _) -> 7418#
benchmark :: IO ()
benchmark = defaultMain [ bench "d" $ whnf d' A
, bench "e" $ whnf e' (E 0)
, bench "s" $ whnf s' (S A)
, bench "t" $ whnf t' (T (E 0))
d' x = I# (d x)
e' x = I# (e x)
s' x = I# (s x)
t' x = I# (t x)
heapTest :: (NFData a,Show a,Eq a,Enum a,Bounded a) => a -> IO ()
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x
main :: IO ()
main =
#if defined TEST_D
heapTest (A :: D)
#elif defined TEST_E
heapTest (E 0 :: E)
#elif defined TEST_S
heapTest (S A :: S)
#elif defined TEST_T
heapTest (T (E 0) :: T)
-- A minor rant:
-- For reliable statistics, I hope Criterion will run the code in *random order*,
-- at least for comparing functions with the same type. Elapsed times on my system are just too
-- noisy to conclude anything.
# If you dont't like the ATT syntax in the output assembly, use this: -fllvm -optlc --x86-asm-syntax=intel
GHC_DEBUG_FLAGS= -keep-s-file -keep-llvm-file # -optlc --x86-asm-syntax=intel
GHCFLAGS=-O2 -funbox-strict-fields -rtsopts -fllvm -fwarn-missing-signatures
GHC_PROF_MAKE=$(GHC) -prof -auto-all -caf-all --make $(GHCFLAGS)
all : benchmark enumtest_all
enumtest_d : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_D
enumtest_e : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_E
enumtest_s : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_S
enumtest_t : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_T
enumtest_all : enumtest_d enumtest_e enumtest_s enumtest_t
for x in $^; do ./$$x +RTS -sstderr ;done
benchmark : EnumTest
time ./$^
% : %.hs
$(GHC_MAKE) -o $@ $^
%.core : %.hs
$(GHC) -S $(GHCFLAGS) $(GHC_DEBUG_FLAGS) -ddump-simpl -dsuppress-all -dsuppress-coercions -ddump-stranal $^ > $@
clean :
rm *.hi *.o *.core *.s enumtest_? ; true
答案 0 :(得分:8)
你误解了这一点。如果你有一个类型的构造函数Daniel Fischer在他的回答中建议将枚举放在多构造函数数据类型的严格字段上可能会阻止某些优化。
,C ... !T ...
的值的构造函数标记,但是GHC只是不这样做(可能有一个原因,它可能比我看到的更复杂)。然而,对于足够小的枚举类型,指针标记mentioned by Mikhail Gushenkov应该具有或多或少相同的效果(可能不完全相同)。
转换 是否会实际上提高性能,取决于您对值的处理方式。它也可能使你的程序变慢。它可能会占用更多内存(见下文)。
没有一般规则,每个案例都需要进行评估。有些模式中newtype-wrapped Int
和E 2
代替E 0
warming up
estimating clock resolution...
mean is 1.549612 us (640001 iterations)
found 4506 outliers among 639999 samples (0.7%)
3639 (0.6%) high severe
estimating cost of a clock call...
mean is 39.24624 ns (12 iterations)
found 2 outliers among 12 samples (16.7%)
1 (8.3%) low mild
1 (8.3%) high severe
benchmarking d
mean: 12.12989 ns, lb 12.01136 ns, ub 12.32002 ns, ci 0.950
std dev: 755.9999 ps, lb 529.5348 ps, ub 1.034185 ns, ci 0.950
found 17 outliers among 100 samples (17.0%)
17 (17.0%) high severe
variance introduced by outliers: 59.503%
variance is severely inflated by outliers
benchmarking e
mean: 10.82692 ns, lb 10.73286 ns, ub 10.98045 ns, ci 0.950
std dev: 604.1786 ps, lb 416.5018 ps, ub 871.0923 ps, ci 0.950
found 10 outliers among 100 samples (10.0%)
4 (4.0%) high mild
6 (6.0%) high severe
variance introduced by outliers: 53.482%
variance is severely inflated by outliers
benchmarking s
mean: 13.18192 ns, lb 13.11898 ns, ub 13.25911 ns, ci 0.950
std dev: 354.1332 ps, lb 300.2860 ps, ub 406.2424 ps, ci 0.950
found 13 outliers among 100 samples (13.0%)
13 (13.0%) high mild
variance introduced by outliers: 20.952%
variance is moderately inflated by outliers
benchmarking t
mean: 11.16461 ns, lb 11.02716 ns, ub 11.37018 ns, ci 0.950
std dev: 853.2152 ps, lb 602.5197 ps, ub 1.086899 ns, ci 0.950
found 14 outliers among 100 samples (14.0%)
3 (3.0%) high mild
11 (11.0%) high severe
variance introduced by outliers: 68.689%
variance is severely inflated by outliers
resp。 E 1
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x
不幸的是,iterate (force . succ')
并没有强制列表元素构建,因此你得到一个thunks列表(增加深度),反转它的初始段,然后 < / em>强制列表元素。
iterate' :: (a -> a) -> a -> [a]
iterate' f !a = a : iterate' f (f a)
(爆炸模式 - WHNF - 足以完全评估相关类型的值)。
,那么只需要list !! index
这就是需要200,000个单词的内容,100,000 Word8
; Int
答案 1 :(得分:7)
在不查看编译器输出的情况下,我认为代码的newtype版本中没有速度增加可能是由于指针标记造成的。在x86上,GHC在每个指针中保留2位,以获取有关指向闭包的信息。 00表示“未评估或未知”,其他3个案例编码已评估构造函数的实际标记。此信息由垃圾收集器动态更新。由于测试数据类型中只有3个案例,因此它们总是适合标记位,因此模式匹配永远不需要间接。尝试向您的数据类型添加更多案例,并查看会发生什么。您可以在本文中找到有关动态指针标记的更多信息:
Faster laziness using dynamic pointer tagging
Simon Marlow,Alexey Rodriguez Yakushev和Simon Peyton Jones,ICFP 2007。