将XML数据导入到具有缺失值的R

时间:2016-01-05 13:13:10

标签: xml r xpath

我目前正在努力将数据从XML文件导入R.

XML文件在数据帧的单行上有多条记录。示例记录:

<rec resultID="5">
  <header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2015-99210-426">
    <controlInfo>
      <bkinfo>
        <btl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</btl>
        <aug />
        <isbn>9781321491562</isbn>
      </bkinfo>
      <chapinfo />
      <revinfo />
      <dissinfo>
        <disstl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</disstl>
      </dissinfo>
      <jinfo>
        <jtl>Dissertation Abstracts International Section A: Humanities and Social Sciences</jtl>
        <issn type="Print">04194209</issn>
      </jinfo>
      <pubinfo>
        <dt year="2015" month="01" day="01">20150101</dt>
        <vid>76</vid>
        <iid>5-A(E)</iid>
      </pubinfo>
      <artinfo>
        <ui type="umi">AAI3671924</ui>
        <tig>
          <atl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</atl>
        </tig>
        <aug>
          <au>Kirchgessner, Mandy L.</au>
        </aug>
        <sug>
          <subj type="major">Animals</subj>
          <subj type="major">Hope</subj>
          <subj type="minor">Conservation (Ecological Behavior)</subj>
          <subj type="minor">Outreach Programs</subj>
          <subj type="minor">Psychological Development</subj>
        </sug>
        <ab>Zoos frequently deploy outreach programs, often called "Zoomobiles," to schools; these programs incorporate zoo resources, such as natural artifacts and live animals, in order to teach standardized content and in hopes of inspiring students to protect the environment. Educational research at zoos is relatively rare, and research on their outreach programs is non-existent. This leaves zoos vulnerable to criticisms as they have little to no evidence that their strategies support their missions, which target conservation outcomes. This study seeks to shed light on this gap by analyzing the impact that live animals have on offsite program participants' interests in animals and subsequent conservation outcomes. The theoretical lens is derived from the field of Conservation Psychology, which believes personal connections with nature serve as the motivational component to engagement with conservation efforts. Using pre, post, and delayed surveys combined with Zoomobile presentation observations, I analyzed the roles of sensory experiences in students' (N=197) development of animal interest and conservation behaviors. Results suggest that touching even one animal during presentations has a significant impact on conservation intents and sustainment of those intents. Although results on interest outcomes are conflicting, this study points to ways this kind of research can make significant contributions to zoo learning outcomes. Other significant variables, such as emotional predispositions and animal-related excitement, are discussed in light of future research directions. (PsycINFO Database Record (c) 2015 APA, all rights reserved)</ab>
        <pubtype>Dissertation Abstract</pubtype>
        <doctype>Dissertation</doctype>
      </artinfo>
      <language>English</language>
    </controlInfo>
    <displayInfo>
      <pLink>
        <url>http://search.ebscohost.com/login.aspx?direct=true&amp;db=psyh&amp;AN=2015-99210-426&amp;site=ehost-live&amp;scope=site</url>
      </pLink>
    </displayInfo>
  </header>
</rec>

我尝试了以下内容,但是对于较大的数据集,它会变慢。此外,当节点中缺少数据时,我希望函数为给定的行/记录返回“NA”但是我不认为这可以用这个函数完成吗?

title <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//atl"), stringsAsFactors = FALSE)
author <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//artinfo/aug/au[1]"), stringsAsFactors = FALSE)
abstract <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//artinfo/ab[1]"), stringsAsFactors = FALSE)
year <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//pubinfo/dt"), stringsAsFactors = FALSE)

我尝试按照R dataframe from XML when values are multiple or missing中的说明操作,但没有成功:

doc = xmlParse(file.choose(), useInternalNodes = TRUE)

do.call(rbind, xpathApply(xmltop, "/rec", function(node) {
  auth <- xmlValue(node[["artinfo/aug/au[1]"]])
    if (is.null(auth)) auth <- NA
  year <- xmlValue(node[["//pubinfo/dt"]])
    if (is.null(year)) year <- NA
  title <- xmlValue(node[["//atl"]])
    if (is.null(title)) title <- NA
  abstract <- xmlValue(node[["//artinfo/ab[1]"]])
    if (is.null(abstract)) abstract <- NA

  data.frame(auth, year, title, abstract, stringsAsFactors = FALSE)

}))

我仍然不太熟悉XPath和R但是我猜想上面的“节点”位存在某种问题?

1 个答案:

答案 0 :(得分:1)

如上所述,考虑运行XSLT将XML简化为一个子级别的行和列,然后可以使用document.getElementById('photo').src=... 轻松导入到R中:

xmlToDataFrame()

R还没有通用的XSLT 1.0处理器。幸运的是,大多数通用语言(包括C#,Java,Python,PHP,Perl,VB)都可以运行XSLT脚本来重新格式化/重新设计复杂的XML数据。下面是带有最终R导入行的Python和VBA脚本。

XSLT 脚本(另存为.xsl或.xslt文件)

<row>
  <column>data</column>
  <column>data</column>
  <column>data</column>
<row>
<row>
  <column>data</column>
  <column>data</column>
  <column>data</column>
<row>

Python 脚本(使用lxml模块)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">    
      <xsl:apply-templates select="@*|node()"/>    
  </xsl:template>

  <!-- Removes Element/Keeps Children Data -->
  <xsl:template match="rec">    
      <xsl:apply-templates />    
  </xsl:template>

  <!-- Replaces Element/Keeps Children Data -->
  <xsl:template match="rec">
    <data>
      <xsl:apply-templates />
    </data>
  </xsl:template>

  <!-- Extracts Needed Elements -->
  <xsl:template match="controlInfo">
    <row>
      <title><xsl:value-of select="artinfo/tig/atl"/></title>
      <author><xsl:value-of select="artinfo/aug/au"/></author>
      <abstract><xsl:value-of select="artinfo/ab"/></abstract>
      <year><xsl:value-of select="pubinfo/dt"/></year>
    </row>
  </xsl:template>

 <!-- Removes Element (empty template) --> 
 <xsl:template match="displayInfo"/> 

</xsl:transform>

VBA宏 (使用MSXML对象)

import lxml.etree as ET

# LOAD XML AND XSL FILES
dom = ET.parse('Input.xml'))
xslt = ET.parse('XSLTScript.xsl'))

# TRANSFORMS INPUT
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUTS FILE
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
print(tree_out.decode("utf-8"))

xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

XML 转换后的输出

Sub TransformXML()
    Dim wb As Workbook
    Dim xmlDoc As Object, xslDoc As Object, newDoc As Object
    Dim strPath As String, xslFile As String
    Dim i As Long

    ' INITIALIZE MSXML OBJECTS '
    Set xmlDoc = CreateObject("MSXML2.DOMDocument")
    Set xslDoc = CreateObject("MSXML2.DOMDocument")
    Set newDoc = CreateObject("MSXML2.DOMDocument")

    ' LOAD XML AND XSL '
    xmlDoc.async = False
    xmlDoc.Load "C:\Path\To\Input.xml"

    xslDoc.async = False
    xslDoc.Load "C:\Path\To\XSLTScript.xsl"

    ' TRANSFORM XML '
    xmlDoc.transformNodeToObject xslDoc, newDoc

    ' OUTPUT XML '
    newDoc.Save "C:\Path\To\Output.xml"

    Set xmlDoc = Nothing
    Set xslDoc = Nothing
    Set newDoc = Nothing

End Sub

R 脚本(使用XML包)

<?xml version='1.0' encoding='UTF-8'?>
    <data>
      <row>
        <title>The impact of zoo live animal presentations on students' 
               propensity to engage in conservation behaviors.</title>
        <author>Kirchgessner, Mandy L.</author>
        <abstract>Zoos frequently deploy outreach programs, often called 
                  "Zoomobiles," to schools; these programs incorporate zoo resources, such as 
                  natural artifacts and live animals, in order to teach standardized content 
                  and in hopes of inspiring students to protect the environment. Educational 
                  research at zoos is relatively rare, and research on their outreach programs 
                  is non-existent. This leaves zoos vulnerable to criticisms as they have 
                  little to no evidence that their strategies support their missions, which 
                  target conservation outcomes. This study seeks to shed light on this gap by 
                  analyzing the impact that live animals have on offsite program participants' 
                  interests in animals and subsequent conservation outcomes. The theoretical 
                  lens is derived from the field of Conservation Psychology, which believes 
                  personal connections with nature serve as the motivational component to 
                  engagement with conservation efforts. Using pre, post, and delayed surveys 
                  combined with Zoomobile presentation observations, I analyzed the roles of 
                  sensory experiences in students' (N=197) development of animal interest and 
                  conservation behaviors. Results suggest that touching even one animal during 
                  presentations has a significant impact on conservation intents and 
                  sustainment of those intents. Although results on interest outcomes are 
                  conflicting, this study points to ways this kind of research can make 
                  significant contributions to zoo learning outcomes. Other significant 
                  variables, such as emotional predispositions and animal-related excitement, 
                  are discussed in light of future research directions. (PsycINFO Database 
                  Record (c) 2015 APA, all rights reserved)</abstract>
        <year>20150101</year>
      </row>
   </data>