如何处理来自R XML的0长度字符向量

时间:2014-05-22 21:12:16

标签: xml r xml-parsing dataframe

我正在从Web服务解析XML文件,然后将其转换为data.frame。在这里,我列出了我的示例代码,老实说,这是一个直接的副本。

http://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/

我承认我对使用XML文件相当新,但我需要将其解析为数据框。

    library(RCurl)
    library(XML)
    xml.url <-('webservice url that links  to an XML document')
    xml.file <- xmlTreeParse(xml.url)
    xmltop <- xmlRoot(xml.file)

    Data <- xmlSApply(xmltop,function(x) xmlSApply(x,xmlValue))
    Data <- data.frame(t(Data),row.names=NULL)

以下是我正在使用的数据示例。我把它限制在几列,因为它们有300多个。

Data <- structure(list(start = structure(list(row = "05/11/2014 06:59:48 UTC", 
    row = "05/11/2014 06:45:59 UTC", row = "05/11/2014 06:26:16 UTC", 
    row = "05/11/2014 06:52:42 UTC"), .Names = c("row", "row", 
    "row", "row")), end = structure(list(row = "05/11/2014 14:16:23 UTC", 
    row = "05/11/2014 13:52:10 UTC", row = "05/11/2014 13:38:41 UTC", 
    row = "05/11/2014 14:34:42 UTC"), .Names = c("row", "row", 
    "row", "row")), today = structure(list(row = "05/11/2014", row = "05/11/2014", 
    row = "05/11/2014", row = "05/11/2014"), .Names = c("row", 
    "row", "row", "row")), Record_Name = structure(list(row = character(0), 
    row = character(0), row = character(0), row = character(0)), .Names = c("row", 
    "row", "row", "row")), Watersource_GPS_Cords = structure(list(
    row = "22.503822:88.347462:0.0:26.0", row = "22.505717:88.348593:20.044726:16.0", 
    row = "22.503821:88.34746:0.0:27.0", row = "22.505585:88.347121:-43.040066:12.0"), .Names = c("row", 
    "row", "row", "row")), Description_of_location = structure(list(
    row = character(0), row = "By swisspark nursing home", row = character(0), 
    row = character(0)), .Names = c("row", "row", "row", "row"
    )), Free_chlorine_input = structure(list(row = "2.5", row = "1.36", 
    row = "1.1", row = character(0)), .Names = c("row", "row", 
    "row", "row"))), .Names = c("start", "end", "today", "Record_Name", 
    "Watersource_GPS_Cords", "Description_of_location", "Free_chlorine_input"
    ), class = "data.frame", row.names = c(NA, -4L))

这是我的sessionInfo()

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_0.6.2  shiny_0.9.1    XML_3.98-1.1   RCurl_1.95-4.1 bitops_1.0-6  

loaded via a namespace (and not attached):
[1] caTools_1.17    digest_0.6.4    httpuv_1.3.0    plyr_1.8.1      Rcpp_0.11.1    
[6] RJSONIO_1.2-0.2 tools_3.0.1     xtable_1.7-3  

一切都很好,除了这些0字符长度向量与数据帧中的列表。我假设因为XML文件的解析器检测到它们必须以列表的形式存储它们然后将列表包装在数据框中,否则向量将不会是相同的长度并且会出错。我真的很困惑关于如何优雅地处理这些事情。我宁愿设置一个选项,将这些选项转换为NA或者只是&#34;&#34;如果可能的话,只需要一个向量而不是列表的数据框,甚至可以将每个列格式化为适当的列。主要是因为我需要在列之间编写逻辑测试。

我习惯于处理类似的结构。

> Data[,"Description_of_location"]
[1]""

[2]"By swisspark nursing home"

[3]""

[4]""

相反,我得到了。

> Data[,"Description_of_location"]

$row
character(0)

$row
[1] "By swisspark nursing home"

$row
character(0)

$row
character(0)

以下是XML文档的示例。

-<data version="1.0">


-<row>

<start type="JAVA_ROSA_DATETIME">05/11/2014 06:59:48 UTC</start>

<end type="JAVA_ROSA_DATETIME">05/11/2014 14:16:23 UTC</end>

<today type="JAVA_ROSA_DATE">05/11/2014</today>


-<deviceid type="STRING">

<![CDATA[358870052616368]]>

</deviceid>


-<subscriberid type="STRING">

<![CDATA[404310209661081]]>

</subscriberid>


-<simid type="STRING">

<![CDATA[89913100002096610814]]>

</simid>


-<phonenumber type="STRING">

<![CDATA[918420272664]]>

</phonenumber>


-<mobilekey type="STRING">

<![CDATA[ag9zfmRlbGFndWFtb2JpbGVyFwsSCk1vYmlsZVVuaXQYgICAgMD6-wkM]]>

</mobilekey>


-<projectkey type="STRING">

<![CDATA[ag9zfmRlbGFndWFtb2JpbGVyFAsSB1Byb2plY3QYgICAgKD9hQkM]]>

</projectkey>


-<recordid type="STRING">

<![CDATA[mannaenergy$$05082014141658$$Published&amp;&amp;12]]>

</recordid>

<Record_Name type="STRING"/>

<Watersource_GPS_Cords type="GEOPOINT">22.****:88.****:0.0:26.0</Watersource_GPS_Cords>


-<State_Name type="STRING">

<![CDATA[West Bengal]]>

</State_Name>


-<District_Name type="STRING">

<![CDATA[Kolkata]]>

</District_Name>


-<Block_Name type="STRING">

<![CDATA[Bikram]]>

</Block_Name>


-<Panchayat_Name type="STRING">

<![CDATA[Ashok nagar]]>

</Panchayat_Name>


-<Village_Name type="STRING">

<![CDATA[East lake]]>

</Village_Name>


-<Habitation_Name type="STRING">

<![CDATA[Merlin colony]]>

</Habitation_Name>


-<Unique_water_source_ID type="STRING">

<![CDATA[15]]>

</Unique_water_source_ID>

<Description_of_location type="STRING"/>



-<Type_of_Water_Source type="STRING">

<![CDATA[Public_tap]]>

</Type_of_Water_Source>

<Take_a_sample_for_chemical_tes type="STRING"/>

<Turbidity_TU_input type="STRING"/>

<Turbidity_FAU_input type="DECIMAL"/>

<Turbidity_FAU_range type="STRING"/>

<Warning_turb_FAU type="INTEGER"/>

<Turbidity_NTU_input type="DECIMAL">0.95</Turbidity_NTU_input>


-<Turbidity_NTU_range type="STRING">

<![CDATA[In_range]]>

</Turbidity_NTU_range>

<Warning_turb_NTU type="INTEGER"/>

我很抱歉,如果这是信息过载我试图提供我能想到的所有有用的东西。

总结一下。我试图解析这个XML文件,使得0长度字符向量变为空白或数据帧中的NA元素。如果我可以将每列的类型反映在数据框中每列的结构中,则可以获得奖励。希望这很清楚。

非常感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

看了之后,

How to transform XML data into a data.frame?

useInternalNodes = TRUE完全符合我的要求。

xml.file <- xmlTreeParse(xml.url,useInternalNodes = TRUE)
xmltop <- xmlRoot(xml.file)
Data <- xmlSApply(xmltop,function(x) xmlSApply(x,xmlValue))
Data <- data.frame(t(Data),row.names=NULL,stringsAsFactors=FALSE)

非常好。

  str(Data[,1])
 chr [1:4] "05/11/2014 06:59:48 UTC" "05/11/2014 06:45:59 UTC" "05/11/2014 06:26:16 UTC" "05/11/2014 06:52:42 UTC"