我有一个可以从这里下载的数据 http://mips.helmholtz-muenchen.de/proj/ppi/ 在页面的末尾,它被写成"您可以获得完整的数据集"
然后我尝试使用xml
包
library(XML)
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE)
root <- xmlRoot(doc)
但似乎是空的
我想要什么?
如果我打开从该网站下载的allppi.xml,
我想将特定行解析为txt文件,它以<fullName>
开头,以</fullName>
结尾
例如,如果我打开该文件,我可以看到这个
<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>
然后我想要这个
Proteins description
S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
答案 0 :(得分:2)
我认为你想要这样的东西(IMO的问题不是很清楚)。我还认为主要问题是默认命名空间,这绝对是一种皇家的痛苦:
library(xml2)
library(purrr)
library(dplyr)
library(stringi)
doc <- read_xml("allppis.xml")
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>%
xml_text() %>%
stri_split_fixed("; ", n=2, simplify=TRUE) %>%
as_data_frame() %>%
setNames(c("Proteins", "Description")) %>%
mutate(Proteins=trimws(Proteins),
Description=trimws(Description))
## # A tibble: 3,628 × 2
## Proteins Description
## <chr> <chr>
## 1 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 2 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 5 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 6 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 9 TRP3 calcium influx channel protein
## 10 IP3R-3 inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows
你需要稍微清理一下(View()
生成的数据框,看看我的意思。)