在R中访问XML文件的实际内容?

时间:2017-07-12 17:07:01

标签: r xml

我正在使用结构良好的XML文件,其中包含以下初始内容:

<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
  <drugbank-id primary="true">DB00001</drugbank-id>
  <drugbank-id>BTD00024</drugbank-id>
  <drugbank-id>BIOD00024</drugbank-id>
  <name>Lepirudin</name>
  <description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
  <cas-number>138068-37-8</cas-number>
  <unii>Y43GF64R34</unii>
  <state>liquid</state>
  <groups>
     <group>approved</group>
  </groups>

...

该文件由许多节点组成,每个节点代表一种药物。我的目标是从这个文件的每个节点中提取两个特定的字段:name和drugbank-id primary =&#34; true&#34;

...并将这些保存到格式整齐的表格中(一列为name,第二列为drugbank-id)。

我已经阅读了许多教程,并且在访问更高级别的XML表结构方面取得了成功,但是如果示例提供了访问实际值的语法(例如特定的药物名称),则此代码对我不起作用。

这是我目前的代码:

library(XML)

# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse("DrugBank_TruncatedDatabase_v3_Small.xml"))

# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)

# Create array structure to hold DrugBank ID values
DB_ID <- array(1:NumNodes, dim=c(1,NumNodes,1))

# Create array structure to hold Drug Name values
DrugName <- array(1:NumNodes, dim=c(1,NumNodes,1))

# for each node (i.e. each drug) in the database
for (i in 1:NumNodes){

    # Assign the Drug Names to easy-to-comprehend DrugName array
    DrugName[i] <- xmldata[[i]][["name"]]

    # Assign the DrugBank ID numbers to easy-to-comprehend DB_ID array
    DB_ID[i] <- xmldata[[i]][["drugbank-id"]]
}

EdgeListTable = data.frame(DrugName, DB_ID)

write.table(EdgeListTable, file="Output1.txt", quote=F)

输出文件包含以下文本,该级别高于我想要的级别: X.name。 X.name..1 X.name..2 X.name..3 X.drugbank.id。 X.drugbank.id..1 X.drugbank.id..2 X.drugbank.id..3 1名称名称名称drugbank-id drugbank-id drugbank-id drugbank-id

如果我尝试:     xmlSApply(xmldata,function(x)xmlSApply(x,xmlValue))

...我的输出如下:

$药物 $药物$ drugbank-id [1]&#34; DB00001&#34;

$药物$ drugbank-id [1]&#34; BTD00024&#34;

$药物$ drugbank-id [1]&#34; BIOD00024&#34;

$ $药物名称 [1]&#34; Lepirudin&#34; ...

...但经过实验,我不确定如何实际访问所需的值。

我很欣赏有关将两个感兴趣的字段中的值存储为表格的最佳方法的建议。

=============================================== =============

更新:我可以使用以下代码提取所需的值:

DrugBankData <- xmlSApply(xmldata, function(x) xmlSApply(x, xmlValue))

for (i in 1:NumNodes){
   DB_ID[i] <- DrugBankData[[i]][[1]]
   DrugName[i] <- DrugBankData[[i]][[4]]
}

EdgeListTable = data.frame(DrugName, DB_ID)
write.table(EdgeListTable, file="Output1.txt", quote=F)

输出文件如下所示: X1 X2 X3 X4 X1.1 X2.1 X3.1 X4.1 1 Lepirudin Cetuximab Dornase alfa Denileukin diftitox DB00001 DB00002 DB00003 DB00004

所以我正在努力将这种格式正确地格式化为列并从该文件中删除第一行文本,以及&#34; 1&#34;在第二行的开头......

2 个答案:

答案 0 :(得分:0)

感谢您的回复,牧民。我最终使用以下代码解决了格式问题(主要是,除了列仍然没有对齐...):

DrugName_Matrix = matrix(DrugName,nrow=NumNodes,ncol=1) 
DrugID_Matrix = matrix(DB_ID,nrow=NumNodes,ncol=1) 
Composite_Matrix = cbind(DrugName_Matrix,DrugID_Matrix,Target)
write.table(Composite_Matrix, file="Output1.txt", sep='\t', row.names=F, quote=F)

仍有神秘的列标题名称(&#34; V1&#34;&#34; V2&#34;)不会出现在这两个矩阵的内容中;我使用标准方法尝试重命名它们是不成功的,例如

colnames(Composite_Matrix)[colnames(Composite_Matrix)=="V1"] <- "Drug Name"
colnames(Composite_Matrix)[colnames(Composite_Matrix)=="V2"] <- "Drug ID"

setnames(Composite_Matrix, old=c("V1","V2"), new=c("DrugName", "DrugID"))

我不确定这些V列标题的来源是什么......

根据要求,两个感兴趣的矩阵的内容是:

> DrugName_Matrix
     [,1]                 
[1,] "Lepirudin"          
[2,] "Cetuximab"          
[3,] "Dornase alfa"       
[4,] "Denileukin diftitox"

> DrugID_Matrix
     [,1]     
[1,] "DB00001"
[2,] "DB00002"
[3,] "DB00003"
[4,] "DB00004"

...并且输出表是:

V1  V2
Lepirudin   DB00001
Cetuximab   DB00002
Dornase alfa    DB00003
Denileukin diftitox DB00004

答案 1 :(得分:0)

要读取毒品库节点,我创建了以下方法:

 drug_sub_df <- function(rec, main_node, seconadary_node = NULL, id = "drugbank-id", byValue = FALSE) {
    parent_key <- NULL
    if (!is.null(id)) {
        parent_key <- xmlValue(rec[id][[1]])
    }

    if (byValue) {
        df <- map_df(rec[main_node], xmlValue)
    } else {
        if (is.null(seconadary_node) && !is.null(rec[[main_node]])) {
            df <- xmlToDataFrame(rec[[main_node]], stringsAsFactors = FALSE)
        } else {
            df <- xmlToDataFrame(rec[[main_node]][[seconadary_node]], stringsAsFactors = FALSE)
        }

    }

    if (nrow(df) > 0 && !is.null(parent_key)) {
        df$parent_key <- parent_key
    }
    return(df)
}

然后我按如下方式调用该方法:

# Extract drug enzymes actions df
get_enzymes_actions_df <- function(rec) {
  return(map_df(xmlChildren(rec[["enzymes"]]),
                ~ drug_sub_df(.x, "actions", id = "id")))
}

# Extract drug articles df
get_enzymes_articles_df <- function(rec) {
  return(map_df(
    xmlChildren(rec[["enzymes"]]),
    ~ drug_sub_df(.x, "references", seconadary_node = "articles", id = "id")
  ))
}

当然。有不同的情况需要不同的解决方案,例如:

get_enzyme_rec <- function(r, drug_key) {
  tibble(
    id = xmlValue(r[["id"]]),
    name = xmlValue(r[["name"]]),
    organism = xmlValue(r[["organism"]]),
    known_action = xmlValue(r[["known-action"]]),
    inhibition_strength = xmlValue(r[["inhibition-strength"]]),
    induction_strength = xmlValue(r[["induction-strength"]]),
    position = ifelse(is.null(xmlGetAttr(r, name = "position")),
                      NA, xmlGetAttr(r, name = "position")),
    parent_key = drug_key
  )
}

get_enzymes_df <- function(rec) {
  return(map_df(xmlChildren(rec[["enzymes"]]),
                ~ get_enzyme_rec(.x, xmlValue(rec["drugbank-id"][[1]]))))
}

或那个

get_atc_codes_rec <- function(r, drug_key) {
  tibble(
    atc_code = xmlGetAttr(r, name = "code"),
    level_1 = xmlValue(r[[1]]),
    code_1 = xmlGetAttr(r[[1]], name = "code"),
    level_2 = xmlValue(r[[2]]),
    code_2 = xmlGetAttr(r[[2]], name = "code"),
    level_3 = xmlValue(r[[3]]),
    code_3 = xmlGetAttr(r[[3]], name = "code"),
    level_4 = xmlValue(r[[4]]),
    code_4 = xmlGetAttr(r[[4]], name = "code"),
    parent_key = drug_key
  )
}

get_atc_codes_df <- function(rec) {
  return (map_df(xmlChildren(rec[["atc-codes"]]),
                 ~ get_atc_codes_rec(.x,
                                     xmlValue(rec["drugbank-id"][[1]]))))
}

在此包中,您可以找到更多示例来提取R中不同结构的药品库XML数据库的内容 https://github.com/Dainanahan/dbparser