R中的XPath:选择值

时间:2015-03-19 10:24:39

标签: xml r xpath bioinformatics

我有一个XML文件,如下所示:

<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
<pathway name="path:hsa04010" org="hsa" number="04010"
         title="MAPK signaling pathway"
         image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
    <entry id="1" name="cpd:C00338" type="compound"
        link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
        <graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
             type="circle" x="138" y="743" width="8" height="8"/>
    </entry>
    <entry id="2" name="hsa:5923 hsa:5924" type="gene"
        link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
        <graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
             type="rectangle" x="392" y="236" width="46" height="17"/>
    <relation entry1="47" entry2="40" type="PPrel">
        <subtype name="activation" value="--&gt;"/>
    </relation>
    <relation entry1="46" entry2="40" type="PPrel">
        <subtype name="activation" value="--&gt;"/>
    </relation>
    <relation entry1="45" entry2="40" type="PPrel">
        <subtype name="activation" value="--&gt;"/>
    </relation>

我想做的是:

  1. 提取具有id的{​​{1}}子项的所有nameentry属性,并将其存储在列表/词典/数据框中以供日后使用。
  2. 提取type="gene"子项的所有属性,并将它们存储在类似的结构中。
  3. 我刚刚开始使用XML解析,我一直在尝试阅读Stackoverflow中的其他问题以及网络上的各种常见问题解答,但我似乎无法让它工作。我可以执行以下操作并根据上面的(1)选择所有节点:

    relation

    ...哪个正常,但我不知道如何获得两个单独的值(在第二种情况下都是这些值)并将它们存储在某处。我试过了

    data = xmlTreeParse('~/Downloads/hsa04010.xml')
    root = xmlRoot(data)
    getNodeSet(root, '/pathway/entry[@type="gene"]')
    

    ...但这只会给我一个错误:

    getNodeSet(root, '/pathway/entry[@type="gene"]/@id')
    

    即使它有效,我也只会获得Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘saveXML’ for signature ‘"XMLAttributeValue"’ 属性,而不是id,我希望如此。但看到我似乎无法获得一个属性值,嗯......

2 个答案:

答案 0 :(得分:1)

KEGGgraph包中有一个可能有帮助的KGML解析器。检查小插图以获取详细信息

library(KEGGgraph)
url <- "http://rest.kegg.jp/get/hsa04010/kgml"
x <- parseKGML(url)

您也可以直接解析网址,然后使用此处建议的不同xpath查询或类似xmlAttrsToDataFrame的内容,这些内容在R book中用于数据科学的新XML中进行了解释。

doc <- xmlParse(url)
genes <- XML:::xmlAttrsToDataFrame(doc["//entry[@type='gene']"])

relations <- XML:::xmlAttrsToDataFrame(doc["//relation"])
relations
    entry1 entry2  type
1       47     40 PPrel
2       46     40 PPrel
3       45     40 PPrel
4       44     39 PPrel
5       43     38 PPrel
...

答案 1 :(得分:0)

你可以尝试

lapply(data['/pathway/entry[@type="gene"]/@id | /pathway/entry[@type="gene"]/*//@name'], as, "character")
# [[1]]
# [1] "2"
# 
# [[2]]
# [1] "RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..."
# 
# [[3]]
# [1] "activation"
# 
# [[4]]
# [1] "activation"
# 
# [[5]]
# [1] "activation"

xpathApply(data, '/pathway/entry[@type="gene"]//relation', xmlAttrs)
# [[1]]
# entry1  entry2    type 
# "47"    "40" "PPrel" 
# 
# [[2]]
# entry1  entry2    type 
# "46"    "40" "PPrel" 
# 
# [[3]]
# entry1  entry2    type 
# "45"    "40" "PPrel

编辑:

data

data <-  xmlParse('<?xml version="1.0"?>
  <!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
  <!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
  <pathway name="path:hsa04010" org="hsa" number="04010"
title="MAPK signaling pathway"
image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
  <entry id="1" name="cpd:C00338" type="compound"
link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
  <graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
type="circle" x="138" y="743" width="8" height="8"/>
  </entry>
  <entry id="2" name="hsa:5923 hsa:5924" type="gene"
link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
  <graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
type="rectangle" x="392" y="236" width="46" height="17"/>
  <relation entry1="47" entry2="40" type="PPrel">
  <subtype name="activation" value="--&gt;"/>
  </relation>
  <relation entry1="46" entry2="40" type="PPrel">
  <subtype name="activation" value="--&gt;"/>
  </relation>
  <relation entry1="45" entry2="40" type="PPrel">
  <subtype name="activation" value="--&gt;"/>
  </relation>
</entry>
</pathway>', asText = TRUE)