我想从Haskell中的大型XML文件(大约20G)中提取信息。由于它是一个大文件,我使用了来自Hexpath的SAX解析函数。
这是我测试的一个简单代码:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
在Cabal中激活分析后,它表示parse.saxEvents
占用了85%的已分配内存。我也使用foldr
,结果是一样的。
如果processEvent
变得足够复杂,程序会因stack space overflow
错误而崩溃。
我做错了什么?
答案 0 :(得分:2)
你不能说processEvent
是什么样的。原则上,使用惰性ByteString
对于延迟生成的输入进行严格的左侧折叠应该没有问题,因此我不确定您的情况会出现什么问题。但是在处理巨大的文件时,应该使用适合流媒体的类型!
事实上,hexpat
确实有&#39;流媒体&#39;界面(就像xml-conduit
)。它使用了不太知名的List
库和the rather ugly List
class it defines。原则上,List包中的ListT
type应该可以正常工作。由于缺少组合器,我很快就放弃了,并为List
的包装版本编写了一个丑陋的Pipes.ListT
类的适当实例,然后我用它来导出普通的Pipes.Producer
函数,如{ {1}}。对此所需的微不足道的操作将在下面附加为parseProduce
一旦我们有了PipesSax.hs
,我们就可以将ByteString或Text Producer转换为带有Text或ByteString组件的parseProducer
生产者。这是一些简单的操作。我使用的是238M&#34; input.xml&#34 ;;程序永远不需要超过6 MB的内存,从SaxEvents
来判断。
- top
大多数IO操作使用在底部定义的Sax.hs
管道,该管道适用于xml的巨大位,这是一个有效的1000片段http://sprunge.us/WaQK < / p>
registerIds
- &#39; library&#39;:PipesSax.hs
这只是newtypes Pipes.ListT来获取适当的实例。我们不会导出与{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
或List
有关的任何内容,只需使用标准的Pipes.Producer概念。
ListT