< - 为完整性而更新(感谢hrbrmstr指出) - >
我试图从Pubmed中提取一些数据,并且我一直在阅读here中的示例(相关图here)。 我的数据的编辑版本如下:
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841882</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D002363">Case Reports</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841881</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
到目前为止,我已经能够使用以下代码很好地提取PublicationTypes(请先在本文末尾的顶部片段中运行代码):
utilAtype <- function(x){
PMID <- xmlValue(x[[1]][[1]])
PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}
PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)
PMID PublicationType
11841882案例报告
11841882期刊论文
11841881期刊论文
但是,在MeshHeadings上使用类似的方法会导致跳过其余的子节点,如下所示:
PMID LName
11841882心肺复苏
- 11841182缺失的其他条目 -
11841881年龄
如果有人能在这方面给我启发,我将不胜感激?它在样本中的表现方式表明这种方法应该没有问题。 请参阅以下代码以供参考。
require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)
utilMesh <- function(x){
PMID <- xmlValue(x[[1]][[1]])
MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA,
sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
}
PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)
c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))
write.csv(PMIDMesh,"Mesh1.csv")
答案 0 :(得分:0)
我会改用xpath,也许......
library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")
y <- lapply(pubs, function(x) data.frame(
pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
mesh = xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )
do.call("rbind", y)
pmid mesh
1 11841882 Cardiopulmonary Resuscitation
2 11841882 Child, Preschool
3 11841882 Female
4 11841882 Heart Arrest
5 11841882 Humans
6 11841882 Infant
7 11841882 Male
8 11841882 Retrospective Studies
9 11841882 Time Factors
10 11841882 Vasoconstrictor Agents
11 11841882 Vasopressins
12 11841881 Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881 Electric Countershock
15 11841881 Family Practice
...