Question

我想从Haskell中的大型XML文件（大约20G）中提取信息。由于它是一个大文件，我使用了来自Hexpath的SAX解析函数。

这是我测试的一个简单代码：

import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax

parse :: FilePath -> IO ()
parse path = do
    inputText <- L.readFile path
    let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
    let txt = foldl' processEvent "" saxEvents
    putStrLn txt

在Cabal中激活分析后，它表示parse.saxEvents占用了85％的已分配内存。我也使用foldr，结果是一样的。

如果processEvent变得足够复杂，程序会因stack space overflow错误而崩溃。

我做错了什么？

Answer 1

你不能说processEvent是什么样的。原则上，使用惰性ByteString对于延迟生成的输入进行严格的左侧折叠应该没有问题，因此我不确定您的情况会出现什么问题。但是在处理巨大的文件时，应该使用适合流媒体的类型！

事实上，hexpat确实有＆＃39;流媒体＆＃39;界面（就像xml-conduit）。它使用了不太知名的List库和the rather ugly List class it defines。原则上，List包中的ListT type应该可以正常工作。由于缺少组合器，我很快就放弃了，并为List的包装版本编写了一个丑陋的Pipes.ListT类的适当实例，然后我用它来导出普通的Pipes.Producer函数，如{ {1}}。对此所需的微不足道的操作将在下面附加为parseProduce

一旦我们有了PipesSax.hs，我们就可以将ByteString或Text Producer转换为带有Text或ByteString组件的parseProducer生产者。这是一些简单的操作。我使用的是238M＆＃34; input.xml＆＃34 ;;程序永远不需要超过6 MB的内存，从SaxEvents来判断。

- top大多数IO操作使用在底部定义的Sax.hs管道，该管道适用于xml的巨大位，这是一个有效的1000片段http://sprunge.us/WaQK < / p>

registerIds

- ＆＃39; library＆＃39;：PipesSax.hs

这只是newtypes Pipes.ListT来获取适当的实例。我们不会导出与{-#LANGUAGE OverloadedStrings #-} import PipesSax ( parseProducer ) import Data.ByteString ( ByteString ) import Text.XML.Expat.SAX import Pipes -- cabal install pipes pipes-bytestring import Pipes.ByteString (toHandle, fromHandle, stdin, stdout ) import qualified Pipes.Prelude as P import qualified System.IO as IO import qualified Data.ByteString.Char8 as Char8 sax :: MonadIO m => Producer ByteString m () -> Producer (SAXEvent ByteString ByteString) m () sax = parseProducer defaultParseOptions -- stream xml from stdin, yielding hexpat tagstream to stdout; main0 :: IO () main0 = runEffect $ sax stdin >-> P.print -- stream the extracted 'IDs' from stdin to stdout main1 :: IO () main1 = runEffect $ sax stdin >-> registryIds >-> stdout -- write all IDs to a file main2 = IO.withFile "input.xml" IO.ReadMode $ \inp -> IO.withFile "output.txt" IO.WriteMode $ \out -> runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out -- folds: -- print number of IDs main3 = IO.withFile "input.xml" IO.ReadMode $ \inp -> do n <- P.length $ sax (fromHandle inp) >-> registryIds print n -- sum the meaningful part of the IDs - a dumb fold for illustration main4 = IO.withFile "input.xml" IO.ReadMode $ \inp -> do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId n <- P.fold (+) 0 id pipeline print n where readIntId :: ByteString -> Integer readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2 -- my xml has tags with attributes that appear via hexpat thus: -- StartElement "FacilitySite" [("registryId","110007915364")] -- and the like. This is just an arbitrary demo stream manipulation. registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m () registryIds = do e <- await -- we look for a 'SAXEvent' case e of -- if it matches, we yield, else we go to the next event StartElement "FacilitySite" [("registryId",a)] -> do yield a yield "\n" registryIds _ -> registryIds或List有关的任何内容，只需使用标准的Pipes.Producer概念。

ListT

如何用有限的资源解析Haskell中的大型XML文件？

1 个答案: