使用XML树解析进行R xml树解析

时间:2018-10-09 00:01:24

标签: r xml

我是R的新手,我正在尝试使用http://www.cs.washington.edu/research/xmldatasets/data/auctions/ebay.xml函数在URL(XML::xmlTreeParse())上读取和加载XML文档,如下所示:

# load necessary packages ---
library(XML)
library(RCurl)

# load necessary data ----
u <- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml"

# convert XML file to an R structure representing the XML/HTML tree
xml.file <- xmlTreeParse(getURL(u), useInternalNodes = TRUE)

但我收到此错误消息:

Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing

有人可以帮我吗?

2 个答案:

答案 0 :(得分:1)

由于内容不会改变,因此一遍又一遍地点击该URL通常是不好的形式。它很小-但是带宽和CPU时间对任何人都不免费。所说的网络拉动也可能是您的问题(内存中下载似乎只是部分下载)。

我们可以使用httr来避免Windows与download.file()的问题,并获得内置缓存(默认情况下,如果目标本地文件存在,则下载次数不会超过一次):

library(httr)
library(XML)
library(xml2)

xml_url <- "https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml.gz"

通过将事物存储在本地命名的位置中来保持井井有条:

dir.create("~/Data/xmldata/auctions", recursive = TRUE)

xml_fil <- file.path("~/Data/xmldata/auctions", basename(xml_url))

获取文件(出于对带宽的考虑而使用gz,并且知道XMLxml2软件包都可以读取文件):

httr::GET(url = xml_url, httr::write_disk(xml_fil))
## Response [https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml.gz]
##   Date: 2018-10-09 08:47
##   Status: 200
##   Content-Type: application/x-gzip
##   Size: 11 kB
## <ON DISK>  /Users/bob/data/xmldata/auctions/ebay.xml.gz

使用XML

using_XML <- XML::xmlTreeParse(xml_fil, useInternalNodes = TRUE)

using_XML
## <?xml version="1.0"?>
## <!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.dtd">
## <root>
##   <listing>
##     <seller_info>
##       <seller_name> cubsfantony</seller_name>
##       <seller_rating> 848</seller_rating>
## ... goes on ...

使用xml2

using_xml2 <- xml2::read_xml(xml_fil)

using_xml2
## {xml_document}
## <root>
## [1] <listing>\n  <seller_info>\n    <seller_name> cubsfantony</seller_na ...
## [2] <listing>\n  <seller_info>\n    <seller_name> ct-inc</seller_name>\n ...
## [3] <listing>\n  <seller_info>\n    <seller_name> ct-inc</seller_name>\n ...
## [4] <listing>\n  <seller_info>\n    <seller_name>bestbuys4systems </sell ...
## [5] <listing>\n  <seller_info>\n    <seller_name> sales@ctgcom.com</sell ...

答案 1 :(得分:0)

总体

在加载了XMLRCurl软件包之后,我能够运行代码而没有错误消息。可能是因为我们在每个软件包中使用了两个不同的版本,所以我将会话信息包含在底部。

代码

# load necessary packages ---
library(XML)   # XML_3.98-1.16 
library(RCurl) # RCurl_1.95-4.11

# load necessary data ----
u <- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml"

# convert XML file to an R structure representing the XML/HTML tree
xml.file <- xmlTreeParse(getURL(u), useInternalNodes = TRUE)

# check class of xml.file
class(xml.file) # [1] "XMLInternalDocument" "XMLAbstractDocument"

# end of script #

会话信息

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] RCurl_1.95-4.11 bitops_1.0-6    XML_3.98-1.16  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19      pillar_1.3.0      compiler_3.5.1   
 [4] plyr_1.8.4        bindr_0.1.1       viridis_0.5.1    
 [7] tools_3.5.1       digest_0.6.17     evaluate_0.11    
[10] tibble_1.4.2      gtable_0.2.0      viridisLite_0.3.0
[13] pkgconfig_2.0.2   rlang_0.2.2       rstudioapi_0.8   
[16] yaml_2.2.0        bindrcpp_0.2.2    gridExtra_2.3    
[19] stringr_1.3.1     dplyr_0.7.6       knitr_1.20       
[22] rprojroot_1.3-2   grid_3.5.1        tidyselect_0.2.4 
[25] glue_1.3.0        R6_2.2.2          rmarkdown_1.10   
[28] ggplot2_3.0.0     purrr_0.2.5       magrittr_1.5     
[31] backports_1.1.2   scales_1.0.0      htmltools_0.3.6  
[34] assertthat_0.2.0  colorspace_1.3-2  stringi_1.2.4    
[37] lazyeval_0.2.1    munsell_0.5.0     crayon_1.3.4