我是R的新手,我正在尝试使用http://www.cs.washington.edu/research/xmldatasets/data/auctions/ebay.xml函数在URL(XML::xmlTreeParse()
)上读取和加载XML文档,如下所示:
# load necessary packages ---
library(XML)
library(RCurl)
# load necessary data ----
u <- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml"
# convert XML file to an R structure representing the XML/HTML tree
xml.file <- xmlTreeParse(getURL(u), useInternalNodes = TRUE)
但我收到此错误消息:
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
有人可以帮我吗?
答案 0 :(得分:1)
由于内容不会改变,因此一遍又一遍地点击该URL通常是不好的形式。它很小-但是带宽和CPU时间对任何人都不免费。所说的网络拉动也可能是您的问题(内存中下载似乎只是部分下载)。
我们可以使用httr
来避免Windows与download.file()
的问题,并获得内置缓存(默认情况下,如果目标本地文件存在,则下载次数不会超过一次):
library(httr)
library(XML)
library(xml2)
xml_url <- "https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml.gz"
通过将事物存储在本地命名的位置中来保持井井有条:
dir.create("~/Data/xmldata/auctions", recursive = TRUE)
xml_fil <- file.path("~/Data/xmldata/auctions", basename(xml_url))
获取文件(出于对带宽的考虑而使用gz
,并且知道XML
和xml2
软件包都可以读取文件):
httr::GET(url = xml_url, httr::write_disk(xml_fil))
## Response [https://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml.gz]
## Date: 2018-10-09 08:47
## Status: 200
## Content-Type: application/x-gzip
## Size: 11 kB
## <ON DISK> /Users/bob/data/xmldata/auctions/ebay.xml.gz
使用XML
:
using_XML <- XML::xmlTreeParse(xml_fil, useInternalNodes = TRUE)
using_XML
## <?xml version="1.0"?>
## <!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.dtd">
## <root>
## <listing>
## <seller_info>
## <seller_name> cubsfantony</seller_name>
## <seller_rating> 848</seller_rating>
## ... goes on ...
使用xml2
:
using_xml2 <- xml2::read_xml(xml_fil)
using_xml2
## {xml_document}
## <root>
## [1] <listing>\n <seller_info>\n <seller_name> cubsfantony</seller_na ...
## [2] <listing>\n <seller_info>\n <seller_name> ct-inc</seller_name>\n ...
## [3] <listing>\n <seller_info>\n <seller_name> ct-inc</seller_name>\n ...
## [4] <listing>\n <seller_info>\n <seller_name>bestbuys4systems </sell ...
## [5] <listing>\n <seller_info>\n <seller_name> sales@ctgcom.com</sell ...
答案 1 :(得分:0)
在加载了XML
和RCurl
软件包之后,我能够运行代码而没有错误消息。可能是因为我们在每个软件包中使用了两个不同的版本,所以我将会话信息包含在底部。
# load necessary packages ---
library(XML) # XML_3.98-1.16
library(RCurl) # RCurl_1.95-4.11
# load necessary data ----
u <- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/ebay.xml"
# convert XML file to an R structure representing the XML/HTML tree
xml.file <- xmlTreeParse(getURL(u), useInternalNodes = TRUE)
# check class of xml.file
class(xml.file) # [1] "XMLInternalDocument" "XMLAbstractDocument"
# end of script #
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] RCurl_1.95-4.11 bitops_1.0-6 XML_3.98-1.16
loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 pillar_1.3.0 compiler_3.5.1
[4] plyr_1.8.4 bindr_0.1.1 viridis_0.5.1
[7] tools_3.5.1 digest_0.6.17 evaluate_0.11
[10] tibble_1.4.2 gtable_0.2.0 viridisLite_0.3.0
[13] pkgconfig_2.0.2 rlang_0.2.2 rstudioapi_0.8
[16] yaml_2.2.0 bindrcpp_0.2.2 gridExtra_2.3
[19] stringr_1.3.1 dplyr_0.7.6 knitr_1.20
[22] rprojroot_1.3-2 grid_3.5.1 tidyselect_0.2.4
[25] glue_1.3.0 R6_2.2.2 rmarkdown_1.10
[28] ggplot2_3.0.0 purrr_0.2.5 magrittr_1.5
[31] backports_1.1.2 scales_1.0.0 htmltools_0.3.6
[34] assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.4
[37] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4