用红宝石解析

时间:2015-04-16 04:00:29

标签: ruby parsing xml-parsing

我是ruby的新手,我有一个学校项目,我正在解析一个xml文件,需要在某些标签后获取数据。我只能使用核心红宝石。没有宝石

    pFile = File.open("myfile.mzML", "r")
    regmsLvl = "ms level\" value=\""

    pFile.each_line { |line|

    scn = line.scan(/#{regmsLvl}(\d)/)
    #what I want to do but doesn't work


    if scn == 1
        puts("Got it!")
    end
    #what I have to do to compare if == 1
    if scn != nil
        scn.each do |val|

    if val[0].to_i == 1
        puts("Got it!")

    end
    end
    end

    }
    # a sample line that I am parsing is: 
    <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" /> 

这看起来很傻。 line.scans out put使scn成为一个二维数组。我怎样才能让它成为每次传递都被覆盖的字符串。或者我应该如何改变这一切。任何建议表示赞赏。 puts(scn)打印出1但如果我做scn == 1或scn.to_i == 1它永远不会进入if。我试过scn.pop和scn.pop.pop

我添加了一个部分来展示我现在要做的事情。

我需要检查ms级别:如果为1,则获取扫描开始时间,然后是二进制。这是我正在使用的代码。

xmlfile = File.new("afile.mzML")
xmldoc = Document.new(xmlfile)


root = xmldoc.root
puts "Root element : " + root.attributes["xmlns"]


 xmldoc.elements.each("mzML/run/spectrumList/spectrum/cvParam"){
|e| if e.attributes["value"].to_i ==1
 # Now I need to get start time: @  
    ["mzML/run/spectrumList/spectrum/cvParam/scanList/scan/value"]
 # and then
    ["mzML/run/spectrumList/spectrum/cvParam/binaryDataArrayList/binaryDataArray/binary"]

end

}

<run id="ru_0" defaultInstrumentConfigurationRef="ic_0" sampleRef="sa_0" defaultSourceFileRef="sf_ru_0">
    <spectrumList count="3310" defaultDataProcessingRef="dp_sp_0">
        <spectrum id="scan=8839" index="0" defaultArrayLength="171" dataProcessingRef="dp_sp_0">
            <cvParam cvRef="MS" accession="MS:1000525" name="spectrum representation" />
            <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" />
            <cvParam cvRef="MS" accession="MS:1000294" name="mass spectrum" />
            <cvParam cvRef="MS" accession="MS:1000130" name="positive scan" />
            <scanList count="1">
                <cvParam cvRef="MS" accession="MS:1000795" name="no combination" />
                <scan>
                    <cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="5429.47" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
                </scan>
            </scanList>
            <binaryDataArrayList count="2">
                <binaryDataArray encodedLength="1824">
                    <cvParam cvRef="MS" accession="MS:1000514" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
                    <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
                    <cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
                    <binary>AAAAQBCdgkAAAACAP6KCQAAAAAA8pIJAAAAAYAWlgkAAAABgQ6aCQAAAAGCzp4JAAAAAQEaogkAAAACgDKqCQAAAAEAgqoJAAAAAwEOqgkAAAABAWKqCQAAAAGBErIJAAAAAIOetgkAAAABAMLCCQAAAAGDlsYJAAAAA4DeygkAAAACAw7SCQAAAACBauIJAAAAAwFC6gkAAAACAYb6CQAAAAIDnwYJAAAAAwDjHgkAAAAAATMyCQAAAAADnzIJAAAAAAArOgkAAAACgTc6CQAAAAKBqzoJAAAAAQJLPgkAAAACAVNCCQAAAAAAK0oJAAAAAIF7SgkAAAADABNSCQAAAAKAx1YJAAAAAYHXXgkAAAAAg3teCQAAAAOAf2oJAAAAAICbcgkAAAAAAx92CQAAAAKA03oJAAAAAIBXigkAAAABAO+KCQAAAAKCr5YJAAAAAYMnlgkAAAADgK+aCQAAAAKDq6YJAAAAAAC3qgkAAAACgNe6CQAAAAMCA74JAAAAAANL0gkAAAAAAUfiCQAAAAOCt+YJAAAAA4O75gkAAAACAPPqCQAAAAGBq/oJAAAAAwEQCg0AAAABAKAqDQAAAAAAoDoNAAAAA4G0Og0AAAADAZhKDQAAAACCBEoNAAAAAwIQWg0AAAABAjheDQAAAAMA+GoNAAAAAQIYag0AAAAAA7RyDQAAAAEB9HYNAAAAAwIseg0AAAADgbyKDQAAAAAAPJINAAAAAgEUlg0AAAACgYCaDQAAAAOBfKoNAAAAA4DAug0AAAADAZi+DQAAAAAA0MINAAAAAoFMwg0AAAAAgMjKDQAAAACA2NINAAAAAgDk2g0AAAAAg+DyDQAAAAOAfPoNAAAAAAKU/g0AAAAAgQUKDQAAAAKBVQoNAAAAAYNRHg0AAAAAgf0qDQAAAAICZSoNAAAAAIDFQg0AAAAAgM1KDQAAAAEBjUoNAAAAAoGNUg0AAAAAAZ1aDQAAAAABqWINAAAAAYHhZg0AAAACAfl2DQAAAAEAcXoNAAAAAICpfg0AAAADgw2GDQAAAAACmZ4NAAAAAQDRog0AAAABAiWqDQAAAAAAibYNAAAAAQHpug0AAAABAEnKDQAAAAABCcoNAAAAAoHxyg0AAAACgGXaDQAAAAMBDdoNAAAAAgJR2g0AAAAAgHHqDQAAAAEBGeoNAAAAAIHh6g0AAAABAl3qDQAAAAKCkfYNAAAAAYE5+g0AAAAAAm36DQAAAAEDigYNAAAAAQGWCg0AAAABAjYKDQAAAACClgoNAAAAA4ESGg0AAAABgYIaDQAAAAMDSh4NAAAAAYCqIg0AAAADAT4qDQAAAAACCioNAAAAAwJmOg0AAAABAnZKDQAAAAKDJlINAAAAAgHGWg0AAAABgl5eDQAAAAEB4mINAAAAA4B2eg0AAAADgKKCDQAAAAGAvooNAAAAAwJakg0AAAABAUaiDQAAAAGBgqoNAAAAAIBatg0AAAADAxa6DQAAAAKCosoNAAAAAICy6g0AAAAAAbrqDQAAAAACRuoNAAAAAAMa/g0AAAACgOsCDQAAAAABzwoNAAAAAIOTCg0AAAACADcWDQAAAAGB4xoNAAAAAQOfGg0AAAAAAvceDQAAAAEBZyoNAAAAA4OnKg0AAAAAgMs6DQAAAAOC/z4NAAAAAYInUg0AAAABgftaDQAAAAODC1oNAAAAAwJXXg0AAAAAAgdiDQAAAAKA/2oNAAAAAoILag0AAAABghtyDQAAAAGCm3INAAAAAAO7cg0AAAACgr9+DQAAAAGCY4oNAAAAAgDbkg0AAAABAN+WDQAAAAKBU5oNA</binary>
                </binaryDataArray>

1 个答案:

答案 0 :(得分:0)

我觉得你很亲密。假设你可以使用那个REXML库(它看起来像是核心ruby库的一部分)你应该能够做到这一点

require 'rexml/document'

xmlfile = File.new("afile.mzML")
xmldoc = REXML::Document.new(xmlfile)
root = xmldoc.root

start_time = nil
binary = nil
# get the ms level
ms_level = root.elements["spectrumList/spectrum/cvParam[@name='ms level']"].attributes["value"].to_i

if ms_level == 1
  # get the scan start time
  start_time = root.elements["spectrumList/spectrum/scanList/scan/cvParam[@name='scan start time']"].attributes["value"]
  # get the binary
  binary = root.elements["spectrumList/spectrum/binaryDataArrayList/binaryDataArray/binary"].text
end

p start_time # => "5429.47"
p binary # => that crazy long binary

此REXML教程非常有用:http://www.germane-software.com/software/rexml/docs/tutorial.html

注意,我做了一些假设,就像元素总是存在一样,ms级别总是一个int,文件结构总是一样的。在你的情况下,这些假设可能不正确,但这应该是一个开始。