试图追踪Haskell内存泄漏

时间:2013-09-20 18:49:10

标签: haskell nlp

所以我试图将40,000篇文章的语料库分解为文章中每个单词的tf-idf权重。我有大约300MB的评论。然而,当我尝试分析这些评论中的一小部分(~1000)时,我得到了非常大的内存消耗。 1000条评论需要大约600MB才能实现这一点。这是不可接受的

堆分析显示,正如预期的那样,所有内存(~550MB)都将被分配给ByteStrings。考虑到前1000条评论仅包含50MB,这似乎很高。此外,我甚至没有保留评论的全文机构。我已经尝试严格加入(这通常可以解决问题),但它从注释中受益匪浅。我也尝试过线性哈希表而不是基本哈希表但性能却相同。

我怀疑foldM的减少存在一些问题。大多数时间/ alloc都花在了extractReview逻辑上。但我看不出任何明显的罪犯。

任何帮助都将不胜感激。

相关代码(省略了一些辅助函数):

processReview :: Int -> [Review] -> String -> IO [Review]
processReview n stack file = do !raw <- B.readFile file
                                !newr <- extractReview n raw
                                return $ newr : stack

extractReview :: Int -> B.ByteString -> IO Review
extractReview n  r = do  !new_ngrams <- count_ngrams n body
                         return $ Review {ngrams = new_ngrams, url = safeNode url, isbns = map strContent isbns} 
                     where (Just !elem) = parseXMLDoc r
                           !body = cleanUTF8 $ B8.pack $ safeNode $ findElement (QName "body" Nothing Nothing) elem
                           !isbns = findElements (QName "isbn" Nothing Nothing) elem
                           !url = findElement (QName "url" Nothing Nothing) elem
                           safeNode = maybe "" (\m -> strContent m)

count_ngrams :: Int -> BL.ByteString -> IO Ngrams
count_ngrams n rbody = do !new_list <- H.new
                          !ngrams <- foldM (\h w -> let !w' = lowercase w in if elem w' ignore_words then return h                                                                                                                               
                                                                                                     else increment_ngram 1 h w') new_list word_list
                          return ngrams
                        where !just_words = BL.filter (\c -> c == 32 || (c >= 65 && c <= 90) || (c >= 97 && c <= 122)) (rbody)
                              !word_list = BL.split 32 just_words

increment_ngram :: Int -> Ngrams -> BL.ByteString -> IO Ngrams
increment_ngram amount ns word = do count <- H.lookup ns word
                                    case count of
                                         (Just i) -> H.insert ns word (i + amount)
                                         Nothing -> H.insert ns word amount
                                    return ns

sumNgrams :: [Review] -> IO Ngrams
sumNgrams reviews = do dict <- H.new
                       mapM_ (\r -> H.mapM_ (\(k,v) -> increment_ngram 1 dict k) (ngrams r)) reviews 
                       return dict                        


main = do
       [n] <- getArgs
       ngrams <- H.new :: IO (H.BasicHashTable Review Ngrams)
       reviews <- fmap (map (\c -> "./reviews/" ++ c) . filter (isInfixOf "xml") . take 500) $ getDirectoryContents "./reviews"
       analyzed_reviews <- foldM (\stack r -> processReview (read n) stack r) [] reviews

0 个答案:

没有答案