我正在尝试Haskell并且非常享受这种体验,但我正在评估它是否具有一些相当严格的性能要求的真实项目。我的任务的第一步是处理维基百科的完整(无历史)转储(bzip) - 总共约6Gb压缩。在python中,每个原始页面(总共大约1000万)的完整提取的脚本在我的盒子上大约需要30分钟(对于参考,使用pull解析器的scala实现大约需要40分钟)。我一直在尝试使用Haskell和ghc来复制这个性能,并且一直在努力匹配它。
我一直在使用Codec.Compression.BZip进行解压缩,使用hexpat进行解析。我使用lazy bytestrings作为hexpat的输入,使用严格的字节串作为元素文本类型。为了提取每个页面的文本,我正在构建一个指向文本元素的指针的Dlist,然后迭代它以将其转储到stdout。我刚刚描述的代码已经通过了许多分析/重构迭代(我很快从字符串转换到字节串,然后从字符串连接转移到指向文本的指针列表 - 然后转到指向文本的指针列表)。我认为我从原始代码中获得了大约2个数量级的加速,但它仍需要一个半小时来解析(尽管它有一个可爱的小内存占用)。所以我正在寻找社区的一些灵感,让我更加努力。代码如下(为了从分析器中获取更多细节,我将其分解为许多子功能)。请原谅我的Haskell - 我只编写了几天(与Real World Haskell一起度过了一个星期)。并提前感谢!
import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip
import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format
testFile = "../data/enwiki-latest-pages-articles.xml.bz2"
validPage pageData = case pageData of
(Just _, Just _) -> True
(_, _) -> False
scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
h:t -> DList.append (getContent h) (scanChildren t)
[] -> DList.fromList []
getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
case treeElement of
(Element name attributes children) -> scanChildren children
(Text text) -> DList.fromList [text]
rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)
extractText page = do
revision <- findChild (BS.pack "revision") page
text <- findChild (BS.pack "text") revision
return text
pageDetails tree =
let pageNodes = filterChildren relevantChildren tree in
let getPageData page = (findChild (BS.pack "title") page, extractText page) in
map rawData $ filter validPage $ map getPageData pageNodes
where
relevantChildren node = case node of
(Element name attributes children) -> name == (BS.pack "page")
(Text _) -> False
outputPages pagesText = do
let flattenedPages = map DList.toList pagesText
mapM_ (mapM_ BS.putStr) flattenedPages
readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)
main = do
rawContent <- readCompressed testFile
let (tree, mErr) = parseXml rawContent
let pages = pageDetails tree
let pagesText = map snd pages
outputPages pagesText
putStrLn "Complete!"
exitWith ExitSuccess
答案 0 :(得分:5)
运行程序后,我得到了一些奇怪的结果:
./wikiparse +RTS -s -A5m -H5m | tail ./wikiparse +RTS -s -A5m -H5m 3,604,204,828,592 bytes allocated in the heap 70,746,561,168 bytes copied during GC 39,505,112 bytes maximum residency (37822 sample(s)) 2,564,716 bytes maximum slop 83 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 620343 collections, 0 parallel, 15.84s, 368.69s elapsed Generation 1: 37822 collections, 0 parallel, 1.08s, 33.08s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 243.85s (4003.81s elapsed) GC time 16.92s (401.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 260.77s (4405.58s elapsed) %GC time 6.5% (9.1% elapsed) Alloc rate 14,780,341,336 bytes per MUT second Productivity 93.5% of total user, 5.5% of total elapsed
我认为总时间超过OK:260s比Python快30m。我不知道为什么整体时间如此之大。我真的不认为阅读6Gb文件需要一个多小时才能完成。
我再次运行你的程序来检查结果是否一致。
如果那些4'20''的结果是正确的,那么我相信机器出了问题......或者这里有一些其他奇怪的效果。
代码是在GHC 7.0.2上编译的。
编辑:我尝试过上面这个程序的各种版本。最重要的优化似乎是{ - #INLINE# - } pragma和函数的特化。有些具有相当普通的类型,已知这些类型对性能有害。 OTOH我相信内联应足以触发专业化,所以你应该尝试进一步尝试。
我没有看到我试过的GHC版本有任何显着差异(6.12 .. HEAD)。
Haskell与bzlib的绑定似乎具有最佳性能。以下程序几乎完全重新实现了标准bzcat
程序,与原始程序一样快或甚至更快。
module Main where
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)
main :: IO ()
main = do
files <- getArgs
mapM_ (\f -> readCompressed f >>= BSL.putStr) files
在我的机器上,需要大约1100秒才能将测试文件解压缩到/dev/null
。我能够获得的最快版本是基于SAX样式解析器。我不确定输出是否与原始输出相匹配。在小输出上,结果是相同的,性能也是如此。在原始文件上,SAX版本更快,并在~2400秒内完成。你可以在下面找到它。
{-# LANGUAGE OverloadedStrings #-}
import System.Exit
import Data.Maybe
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.IO
import Text.XML.Expat.SAX as SAX
type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]
testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"
readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)
{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []
{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"
{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"
{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"
{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs
{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart
{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))
main :: IO ()
main = do
rawContent <- readCompressed testFile
let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
putStrLn "Complete!"
一般来说,我怀疑Python和Scala的版本是否提前完成。如果没有源代码,我无法验证该声明。
总结一下:内联和专业化应该给出合理的,大约两倍的性能提升。