如何优化完全严格的循环

时间:2012-11-06 14:21:55

标签: performance haskell micro-optimization

我正在尝试为Project Euler Problem #145编写一个强力解决方案,我无法让我的解决方案在不到1分30秒的时间内运行。

(我知道有各种快捷方式甚至是纸笔解决方案;出于这个问题的目的,我不会考虑这些问题。)

在迄今为止我提出的最佳版本中,分析显示大多数时间花在foldDigits上。这个函数根本不需要是懒惰的,在我看来应该优化到一个简单的循环。正如你所看到的,我试图对程序的各个部分进行严格的修改。

所以我的问题是:在不改变整体算法的情况下,是否有某种方法可以将此程序的执行时间降低到亚分钟?

(或者,如果没有,有没有办法看到foldDigits的代码尽可能优化?)

-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4

{-# LANGUAGE BangPatterns #-}

import Control.Parallel.Strategies

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
    | n < 10    = i
    | otherwise = foldDigits f i d
  where (d, m) = n `quotRem` 10
        !i     = f acc m

reverseNumber :: Int -> Int
reverseNumber !n
    = foldDigits accumulate 0 n
  where accumulate !v !d = v * 10 + d

allDigitsOdd :: Int -> Bool
allDigitsOdd n
    = foldDigits andOdd True n
  where andOdd !a d = a && isOdd d
        isOdd !x    = x `rem` 2 /= 0

isReversible :: Int -> Bool
isReversible n
    = notDivisibleByTen n && allDigitsOdd (n + rn)
  where rn                   = reverseNumber n
        notDivisibleByTen !x = x `rem` 10 /= 0

countRange acc start end
    | start > end = acc
    | otherwise   = countRange (acc + v) (start + 1) end
  where v = if isReversible start then 1 else 0

main
    = print $ sum $ parMap rseq cr ranges
  where max       = 1000000000
        qmax      = max `div` 4
        ranges    = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
        cr (s, e) = countRange 0 s e

1 个答案:

答案 0 :(得分:8)

就目前而言,ghc-7.6.1为foldDigits-O2)生成的核心是

Rec {
$wfoldDigits_r2cK
  :: forall a_aha.
     (a_aha -> GHC.Types.Int -> a_aha)
     -> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
  \ (@ a_aha)
    (w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
    (w1_s285 :: a_aha)
    (ww_s288 :: GHC.Prim.Int#) ->
    case w1_s285 of acc_Xhi { __DEFAULT ->
    let {
      ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
      [LclId, Str=DmdType]
      ds_sNo =
        case GHC.Prim.quotRemInt# ww_s288 10
        of _ { (# ipv_aJA, ipv1_aJB #) ->
        (GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
        } } in
    case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
    of i_ahg { __DEFAULT ->
    case GHC.Prim.<# ww_s288 10 of _ {
      GHC.Types.False ->
        case ds_sNo of _ { (d_Xsi, m_Xs5) ->
        case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
        $wfoldDigits_r2cK @ a_aha w_s284 i_ahg ww1_X28L
        }
        };
      GHC.Types.True -> i_ahg
    }
    }
    }
end Rec }

,如您所见,重新列出quotRem电话的结果。问题是此处没有f的属性,并且作为递归函数,foldDigits无法内联。

使用手动工作包装器转换使函数参数为静态

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
  where
    go !acc 0 = acc
    go acc n = case n `quotRem` 10 of
                 (q,r) -> go (f acc r) q

foldDigits变得无法使用,您可以获得针对未装箱数据运行的专用版本,但没有顶级foldDigits,例如

Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
  \ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
    case ww1_s28J of ds_XJh {
      __DEFAULT ->
        case GHC.Prim.quotRemInt# ds_XJh 10
        of _ { (# ipv_aJK, ipv1_aJL #) ->
        $wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
        };
      0 -> ww_s28F
    }
end Rec }

对计算时间的影响是有形的,对于原文,我得到了

$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
     196,407,088 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1827331 colls, 1827331 par   23.77s   11.86s     0.0000s    0.0041s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 54.94% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  620.52s  (313.51s elapsed)
  GC      time   23.77s  ( 11.86s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  644.29s  (325.37s elapsed)

  Alloc rate    2,923,834,808 bytes per MUT second

(我使用-N2,因为我的i5只有两个物理核心),vs。

$ ./eul145 +RTS -s -N2
608720
  16,000,063,624 bytes allocated in the heap
         403,384 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15852 colls, 15852 par    0.34s    0.17s     0.0000s    0.0037s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 43.86% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  314.85s  (160.08s elapsed)
  GC      time    0.34s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  315.20s  (160.25s elapsed)

  Alloc rate    50,817,657 bytes per MUT second

  Productivity  99.9% of total user, 196.5% of total elapsed

修改。运行时间大致减半,分配减少了100倍。