在解析XML时,在第一个实例之后正在停止

时间:2015-06-10 11:37:17

标签: xml r sapply

< - 为完整性而更新(感谢hrbrmstr指出) - >

我试图从Pubmed中提取一些数据,并且我一直在阅读here中的示例(相关图here)。 我的数据的编辑版本如下:

<PubmedArticleSet>
   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841882</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D002363">Case Reports</PublicationType>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
         <MeshHeadingList>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
            </MeshHeading>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
               <QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
            </MeshHeading>
         </MeshHeadingList>
      </MedlineCitation>       
   </PubmedArticle>

   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841881</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
      <MeshHeadingList>
           <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
           </MeshHeading>
           <MeshHeading>
              <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
           </MeshHeading>
        </MeshHeadingList>
     </MedlineCitation>    
   </PubmedArticle>
</PubmedArticleSet>

到目前为止,我已经能够使用以下代码很好地提取PublicationTypes(请先在本文末尾的顶部片段中运行代码):

utilAtype <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
        data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}

PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)

PMID PublicationType

11841882案例报告

11841882期刊论文

11841881期刊论文

但是,在MeshHeadings上使用类似的方法会导致跳过其余的子节点,如下所示:

PMID LName

11841882心肺复苏

- 11841182缺失的其他条目 -

11841881年龄

如果有人能在这方面给我启发,我将不胜感激?它在样本中的表现方式表明这种方法应该没有问题。 请参阅以下代码以供参考。

require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)
utilMesh <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA, 
                sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
        data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
    }

PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)
c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))

write.csv(PMIDMesh,"Mesh1.csv")

1 个答案:

答案 0 :(得分:0)

我会改用xpath,也许......

library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")

y <- lapply(pubs, function(x) data.frame(
     pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
     mesh =  xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )

do.call("rbind", y)

       pmid                          mesh
1  11841882 Cardiopulmonary Resuscitation
2  11841882              Child, Preschool
3  11841882                        Female
4  11841882                  Heart Arrest
5  11841882                        Humans
6  11841882                        Infant
7  11841882                          Male
8  11841882         Retrospective Studies
9  11841882                  Time Factors
10 11841882        Vasoconstrictor Agents
11 11841882                  Vasopressins
12 11841881                          Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881         Electric Countershock
15 11841881               Family Practice
...