Haskell速度/内存使用情况

时间:2015-02-28 10:49:34

标签: haskell memory-management

我试图用Haskell处理一些Point Cloud数据,似乎使用了大量的内存。我使用的代码如下,它基本上将数据解析为我可以使用的格式。数据集有440MB,行数为10M。当我使用runhaskell运行它时,它会在短时间内耗尽所有内存(~3-4gb),然后崩溃。如果我用-O2编译它并运行它,它会转到100%cpu并需要很长时间才能完成(约3分钟)。我应该提一下,我使用的是带有4GB内存和SSD的i7 cpu,因此应该有足够的资源。我怎样才能提高性能呢?

{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)

filename :: FilePath
filename = "sample.txt"

readTextMaybe = readMaybe . unpack

data Classification = Classification
    { id :: Int, description :: Text
    } deriving (Show)

data Point = Point
    { x :: Int, y :: Int, z :: Int, classification :: Classification
    } deriving (Show)

type PointCloud = [Point]

maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
    where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
          toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
          toMaybePoint _ _ _ _ _                                   = Nothing
          parse :: [Text] -> Maybe Point
          parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
          parse _                     = Nothing

readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines

main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)

1 个答案:

答案 0 :(得分:3)

在没有优化的情况下编译时使用所有内存的原因很可能是因为sum是使用foldl定义的。如果没有优化带来的严格性分析,那将会非常糟糕。您可以尝试使用此功能:

sum' :: Num n => [n] -> n
sum' = foldl' (+) 0

使用优化编译时这种速度很慢的原因似乎与解析输入的方式有关。在读入输入时将为每个字符分配缺点,并且在将输入分成行时再次分配缺点,并且可能在分割逗号时再次分配缺点。使用适当的解析库(其中任何一个)几乎肯定会有所帮助;使用pipesconduit之类的流式传输可能是最好的(我不确定)。

另一个与性能无关的问题:fromJust一般来说形式相当差,在处理用户输入时是一个非常糟糕的主意。您应该mapM替换Maybe monad中的列表,这将为您生成Maybe [Point]