Question

我想在xml文件的最顶层节点添加一个属性，然后保存该文件。我已经尝试了我能想到的xpath和子集的各种组合，但似乎无法使它工作。使用一个简单的例子：

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

当我尝试

时

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

我收到错误：

object of type 'externalptr' is not subsettable

但是，该属性已插入，因此我不确定我做错了什么。

（更多背景信息：这是来自Scopus API的数据的简化版本。我正在组合数千个结构相似的xml文件，但id位于＆＃34; coredata＆＃34;节点，这是包含所有数据的＆＃34; author＆＃34;节点的兄弟，所以当我使用SAS将组合的XML文档编译成数据集时，id和数据之间没有链接。我希望将id添加到层次结构的顶部将导致它向下传播到所有其他级别）。

Answer 1

为了根据数据集和数据框架的结构将XML数据迁移到行和列的二维中，必须将所有嵌套移除到仅迭代父级和一个子级。因此，XSLT是一种将XML文档重新构建为任何细微差别需求的专用声明性编程语言，可以方便地重构XML数据以满足最终用途需求。

鉴于您的示例XML，下面是一个可以运行的XSLT，并且可以将生成的XML成功导入SAS。让SAS代码循环以重构所有数千个XML文件。

XSLT （另存为.xsl或.xslt格式）

 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
       xmlns:ait="http://www.elsevier.com/xml/ani/ait"
       xmlns:ce="http://www.elsevier.com/xml/ani/common"
       xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
       xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
       xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
       xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
       exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />

 <xsl:template match="author-retrieval-response">
  <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
  <root>
      <coredata>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="coredata/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="concat(.,@href)"/>
          </xsl:element>
        </xsl:for-each>
      </coredata>

      <subjectAreas>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="subject-areas/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </subjectAreas>

      <authorname>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/preferred-name/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </authorname>

      <classifications>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/classificationgroup/classifications/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </classifications>

      <journals>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/journal-history/journal/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </journals>

      <ipdoc>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </ipdoc>

      <address>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </address>  
  </root>
 </xsl:template>

</xsl:transform>

SAS （使用上面的脚本）

proc xsl 
    in="C:\Path\To\Original.xml"
    out="C:\Path\To\Output.xml"
    xsl="C:\Path\To\XSLT.xsl";
run;

** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata; 
    retain authorid;
    set temp.Coredata;  ** NAME OF PARENT NODE IN XML;
run;

data Work.SubjectAreas; 
    retain authorid;
    set temp.SubjectAreas;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Authorname;   
    retain authorid;
    set temp.Authorname;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Classifications;
    retain authorid;
    set temp.Classifications;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Journals; 
    retain authorid;
    set temp.Journals;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Ipdoc;    
    retain authorid;
    set temp.Ipdoc;  ** NAME OF PARENT NODE IN XML;
run;

XML OUTPUT （导入为一行和40个变量的Authorsdata数据集）

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <coredata>
      <authorid>1234567</authorid>
      <url>http://api.elsevier.com/content/author/author_id/1234567</url>
      <identifier>AUTHOR_ID:1234567</identifier>
      <eid>9-s2.0-1234567</eid>
      <document-count>3</document-count>
      <cited-by-count>95</cited-by-count>
      <citation-count>97</citation-count>
      <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
      <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link>
      <link>http://api.elsevier.com/content/author/author_id/1234567</link>
      <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
   </coredata>
   <subjectAreas>
      <authorid>1234567</authorid>
      <subject-area>Human-Computer Interaction</subject-area>
      <subject-area>Control and Systems Engineering</subject-area>
      <subject-area>Software</subject-area>
      <subject-area>Computer Vision and Pattern Recognition</subject-area>
      <subject-area>Artificial Intelligence</subject-area>
   </subjectAreas>
   <authorname>
      <authorid>1234567</authorid>
      <initials>A.</initials>
      <indexed-name>John A.</indexed-name>
      <surname>John</surname>
      <given-name>Doe</given-name>
   </authorname>
   <classifications>
      <authorid>1234567</authorid>
      <classification>1709</classification>
      <classification>2207</classification>
      <classification>1712</classification>
      <classification>1707</classification>
      <classification>1702</classification>
   </classifications>
   <journals>
      <authorid>1234567</authorid>
      <sourcetitle>Very Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
      <issn>10504729</issn>
      <sourcetitle>2005 Another Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
   </journals>
   <ipdoc>
      <authorid>1234567</authorid>
      <afnameid>Prestigious University#1111111</afnameid>
      <afdispname>Prestigious University University</afdispname>
      <preferred-name>Prestigious University University</preferred-name>
      <sort-name>Prestigious University</sort-name>
      <org-domain>pu.edu</org-domain>
      <org-URL>http://www.pu.edu/index.shtml</org-URL>
   </ipdoc>
   <address>
      <authorid>1234567</authorid>
      <address-part>1234 Prestigious Lane</address-part>
      <city>City</city>
      <state>ST</state>
      <postal-code>12345</postal-code>
      <country>United States</country>
   </address>
</root>

R ALTERNATIVE

由于不存在全面的R XSLT库，因此必须直接在R语言中进行解析。但是，R可以通过命令行，RCOMClient包和其他接口调用其他可执行文件（即Python，Saxon，VBA）的XSLT处理器。

尽管如此，R可以为xmlToDataFrame()提取xpathSApply()和authorid（后者类似于XPath）的XML数据：

library(XML)

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                              xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

authorname <-  xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                            xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                                 xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                       xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

Answer 2

修改在尝试编辑顶级节点的方法之后（参见下面的旧答案），我意识到编辑顶级节点并不能解决我的问题，因为SAS XML映射器没有保留所有ID。

我尝试了一种新方法，即将作者ID添加到完美运行的每个子节点中。我还了解到，您可以使用XPath通过将它们放入向量中来选择多个节点，如下所示：

c("//coredata", "//affiliation-current", "affiliation-history", "subject-areas", "//author-profile")

所以我使用的最终节目是：

files <- list.files() for (i in 1:length(files)) { author_record <- xmlParse(files[i]) xpathApply( author_record, c( "//coredata", "//affiliation-current", "affiliation-history", "subject-areas", "//author-profile" ), addAttributes, auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]])) ) saveXML(author_record, file = files[i]) }

旧答案： 经过多次实验，我发现了一个相当简单的解决方案。

只需使用
即可将属性添加到顶级节点
addAttributes(xmlRoot(xmlfile), attribute = "attributeValue")

对于我的具体情况，最直接的解决方案是一个简单的循环：

setwd("C:/directory/with/individual/xmlfiles") files <- list.files() for (i in 1:length(files)) { author_record <- xmlParse(files[i]) addAttributes(node = xmlRoot(author_record), id = gsub (pattern = "AUTHOR_ID:", replacement = "", x = xmlValue(auth[["//dc:identifier"]]) ) ) saveXML(author_record, file = files[i]) }

我确信有更好的方法。显然我需要学习XLST，这是一种非常强大的方法！

如何修改R中的顶级XML节点？

2 个答案: