如何修改R中的顶级XML节点?

时间:2015-11-10 15:02:05

标签: xml r xslt sas scopus

我想在xml文件的最顶层节点添加一个属性,然后保存该文件。我已经尝试了我能想到的xpath和子集的各种组合,但似乎无法使它工作。使用一个简单的例子:

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

当我尝试

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

我收到错误:

object of type 'externalptr' is not subsettable

但是,该属性已插入,因此我不确定我做错了什么。

(更多背景信息:这是来自Scopus API的数据的简化版本。我正在组合数千个结构相似的xml文件,但id位于&#34; coredata&#34;节点,这是包含所有数据的&#34; author&#34;节点的兄弟,所以当我使用SAS将组合的XML文档编译成数据集时,id和数据之间没有链接。我希望将id添加到层次结构的顶部将导致它向下传播到所有其他级别)。

2 个答案:

答案 0 :(得分:2)

为了根据数据集和数据框架的结构将XML数据迁移到行和列的二维中,必须将所有嵌套移除到仅迭代父级和一个子级。因此,XSLT是一种将XML文档重新构建为任何细微差别需求的专用声明性编程语言,可以方便地重构XML数据以满足最终用途需求。

鉴于您的示例XML,下面是一个可以运行的XSLT,并且可以将生成的XML成功导入SAS。让SAS代码循环以重构所有数千个XML文件。

XSLT (另存为.xsl或.xslt格式)

 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
       xmlns:ait="http://www.elsevier.com/xml/ani/ait"
       xmlns:ce="http://www.elsevier.com/xml/ani/common"
       xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
       xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
       xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
       xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
       exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />

 <xsl:template match="author-retrieval-response">
  <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
  <root>
      <coredata>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="coredata/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="concat(.,@href)"/>
          </xsl:element>
        </xsl:for-each>
      </coredata>

      <subjectAreas>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="subject-areas/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </subjectAreas>

      <authorname>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/preferred-name/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </authorname>

      <classifications>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/classificationgroup/classifications/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </classifications>

      <journals>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/journal-history/journal/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </journals>

      <ipdoc>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </ipdoc>

      <address>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </address>  
  </root>
 </xsl:template>

</xsl:transform>

SAS (使用上面的脚本)

proc xsl 
    in="C:\Path\To\Original.xml"
    out="C:\Path\To\Output.xml"
    xsl="C:\Path\To\XSLT.xsl";
run;

** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata; 
    retain authorid;
    set temp.Coredata;  ** NAME OF PARENT NODE IN XML;
run;

data Work.SubjectAreas; 
    retain authorid;
    set temp.SubjectAreas;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Authorname;   
    retain authorid;
    set temp.Authorname;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Classifications;
    retain authorid;
    set temp.Classifications;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Journals; 
    retain authorid;
    set temp.Journals;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Ipdoc;    
    retain authorid;
    set temp.Ipdoc;  ** NAME OF PARENT NODE IN XML;
run;

XML OUTPUT (导入为一行和40个变量的Authorsdata数据集)

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <coredata>
      <authorid>1234567</authorid>
      <url>http://api.elsevier.com/content/author/author_id/1234567</url>
      <identifier>AUTHOR_ID:1234567</identifier>
      <eid>9-s2.0-1234567</eid>
      <document-count>3</document-count>
      <cited-by-count>95</cited-by-count>
      <citation-count>97</citation-count>
      <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
      <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link>
      <link>http://api.elsevier.com/content/author/author_id/1234567</link>
      <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
   </coredata>
   <subjectAreas>
      <authorid>1234567</authorid>
      <subject-area>Human-Computer Interaction</subject-area>
      <subject-area>Control and Systems Engineering</subject-area>
      <subject-area>Software</subject-area>
      <subject-area>Computer Vision and Pattern Recognition</subject-area>
      <subject-area>Artificial Intelligence</subject-area>
   </subjectAreas>
   <authorname>
      <authorid>1234567</authorid>
      <initials>A.</initials>
      <indexed-name>John A.</indexed-name>
      <surname>John</surname>
      <given-name>Doe</given-name>
   </authorname>
   <classifications>
      <authorid>1234567</authorid>
      <classification>1709</classification>
      <classification>2207</classification>
      <classification>1712</classification>
      <classification>1707</classification>
      <classification>1702</classification>
   </classifications>
   <journals>
      <authorid>1234567</authorid>
      <sourcetitle>Very Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
      <issn>10504729</issn>
      <sourcetitle>2005 Another Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
   </journals>
   <ipdoc>
      <authorid>1234567</authorid>
      <afnameid>Prestigious University#1111111</afnameid>
      <afdispname>Prestigious University University</afdispname>
      <preferred-name>Prestigious University University</preferred-name>
      <sort-name>Prestigious University</sort-name>
      <org-domain>pu.edu</org-domain>
      <org-URL>http://www.pu.edu/index.shtml</org-URL>
   </ipdoc>
   <address>
      <authorid>1234567</authorid>
      <address-part>1234 Prestigious Lane</address-part>
      <city>City</city>
      <state>ST</state>
      <postal-code>12345</postal-code>
      <country>United States</country>
   </address>
</root>

R ALTERNATIVE

由于不存在全面的R XSLT库,因此必须直接在R语言中进行解析。但是,R可以通过命令行,RCOMClient包和其他接口调用其他可执行文件(即Python,Saxon,VBA)的XSLT处理器。

尽管如此,R可以为xmlToDataFrame()提取xpathSApply()authorid(后者类似于XPath)的XML数据:

library(XML)

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                              xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

authorname <-  xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                            xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                                 xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                       xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

答案 1 :(得分:1)

修改 在尝试编辑顶级节点的方法之后(参见下面的旧答案),我意识到编辑顶级节点并不能解决我的问题,因为SAS XML映射器没有保留所有ID。

我尝试了一种新方法,即将作者ID添加到完美运行的每个子节点中。我还了解到,您可以使用XPath通过将它们放入向量中来选择多个节点,如下所示:

c("//coredata",
  "//affiliation-current",
  "affiliation-history",
  "subject-areas",
  "//author-profile")

所以我使用的最终节目是:

files <- list.files()

for (i in 1:length(files)) {
     author_record <- xmlParse(files[i])

     xpathApply(
          author_record, c(
               "//coredata",
               "//affiliation-current",
               "affiliation-history",
               "subject-areas",
               "//author-profile"
          ),
          addAttributes,
          auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
     )

     saveXML(author_record, file = files[i])
}

旧答案: 经过多次实验,我发现了一个相当简单的解决方案。

只需使用

即可将属性添加到顶级节点
addAttributes(xmlRoot(xmlfile), attribute = "attributeValue") 

对于我的具体情况,最直接的解决方案是一个简单的循环:

setwd("C:/directory/with/individual/xmlfiles")

files <- list.files()

for (i in 1:length(files)) {

 author_record <- xmlParse(files[i])

 addAttributes(node = xmlRoot(author_record), 
               id   = gsub   (pattern = "AUTHOR_ID:", 
                              replacement = "", 
                              x = xmlValue(auth[["//dc:identifier"]])
               )
 )

  saveXML(author_record, file = files[i])
}

我确信有更好的方法。显然我需要学习XLST,这是一种非常强大的方法!