如何解析成记录?

时间:2015-04-01 01:41:52

标签: haskell

我之前有一个问题,我从中学到了这个问题。我发现实现导致了一个字符串列表而不是一个记录列表。我正在解析的文件有类似的记录;

  

sp | P30375 | 1A01_GORGO I类组织相容性抗原Gogo-A *0101α链OS =大猩猩大猩猩大猩猩PE = 2 SV = 1   MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRF   DSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRVDLGTLRGYYNQSEDGSHTIQ   RMYGCDVGSDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAEITKRKWEAAHFAEQL   RAYLEGTCVEWLRRHLENGKETLQRTDAPKTHMTHHAVSDHEAILRCWALSFYPAEITLT   WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPEPLTLRWEP   SSQPTIPIVGIIAGLVLFGAVIAGAVVAAVRWRRKSSDRKGGSYSQAASSDSAQGSDVSL   TACKV   sp | P30443 | 1A01_HUMAN HLA I类组织相容性抗原A-1α链OS = Homo sapiens GN = HLA-A PE = 1 SV = 1   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF   DSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQ   IMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAVHAAEQR   RVYLEGRCVDGLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT   WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL   TACKV

在sp之前有一个“>”,我计划将其用作记录分割点。那么,我怎么能最终得到:

[[>sp|P30375|1A01_GORGO Class I histocompatibility antigen Gogo-A*0101 alpha chain OS=Gorilla gorilla gorilla PE=2 SV=1
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRF
DSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRVDLGTLRGYYNQSEDGSHTIQ
RMYGCDVGSDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAEITKRKWEAAHFAEQL
RAYLEGTCVEWLRRHLENGKETLQRTDAPKTHMTHHAVSDHEAILRCWALSFYPAEITLT
WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPEPLTLRWEP
SSQPTIPIVGIIAGLVLFGAVIAGAVVAAVRWRRKSSDRKGGSYSQAASSDSAQGSDVSL
TACKV]
[>sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF
DSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQ
IMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAVHAAEQR
RVYLEGRCVDGLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT
WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL
TACKV]]

使用parsec?这是我开始使用的代码; how to parse a uniprot-file with parsec?

1 个答案:

答案 0 :(得分:3)

据我了解您的问题,您只需解析由'>'分隔的记录。然后你记录的是一个包含所有字符的字符串,但是'>'你正在寻找这样的东西:

import Control.Applicative ((*>))
import Text.Parsec 
import Text.Parsec.ByteString  (Parser,parseFromFile)

type Record = String 

parserFile :: FilePath -> IO [Record]
parserFile fileName = do 
     r <- parseFromFile parseRecords fileName 
     case r of
        Left  msg  -> error . show $ msg
        Right xs -> return xs


parseRecords :: Parser [Record]
parseRecords = many1 $ (char '>')  *> (many1 $ noneOf ['>'])

&#34; parseFromFile&#34;函数使用有效的二进制表示读取数据,并将解析器作为另一个参数来分析读取文件所产生的字节串流。

现在,您所有的记录都以&#39;&gt;&#39;开头。符号,因此您只需要一个与&#39;&gt;&#39;匹配的解析器。在开始时的符号并将其余符号存储在列表中,直到下一个&#39;&gt;&#39;符号。