我试图用Haskell处理一些Point Cloud数据,似乎使用了大量的内存。我使用的代码如下,它基本上将数据解析为我可以使用的格式。数据集有440MB,行数为10M。当我使用runhaskell
运行它时,它会在短时间内耗尽所有内存(~3-4gb),然后崩溃。如果我用-O2
编译它并运行它,它会转到100%cpu并需要很长时间才能完成(约3分钟)。我应该提一下,我使用的是带有4GB内存和SSD的i7 cpu,因此应该有足够的资源。我怎样才能提高性能呢?
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)
filename :: FilePath
filename = "sample.txt"
readTextMaybe = readMaybe . unpack
data Classification = Classification
{ id :: Int, description :: Text
} deriving (Show)
data Point = Point
{ x :: Int, y :: Int, z :: Int, classification :: Classification
} deriving (Show)
type PointCloud = [Point]
maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
toMaybePoint _ _ _ _ _ = Nothing
parse :: [Text] -> Maybe Point
parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
parse _ = Nothing
readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines
main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
答案 0 :(得分:3)
在没有优化的情况下编译时使用所有内存的原因很可能是因为sum
是使用foldl
定义的。如果没有优化带来的严格性分析,那将会非常糟糕。您可以尝试使用此功能:
sum' :: Num n => [n] -> n
sum' = foldl' (+) 0
使用优化编译时这种速度很慢的原因似乎与解析输入的方式有关。在读入输入时将为每个字符分配缺点,并且在将输入分成行时再次分配缺点,并且可能在分割逗号时再次分配缺点。使用适当的解析库(其中任何一个)几乎肯定会有所帮助;使用pipes
或conduit
之类的流式传输可能是最好的(我不确定)。
另一个与性能无关的问题:fromJust
一般来说形式相当差,在处理用户输入时是一个非常糟糕的主意。您应该mapM
替换Maybe
monad中的列表,这将为您生成Maybe [Point]
。