R xmlParse / xmlTreeParse未知IO错误

时间:2017-04-05 12:13:53

标签: r xml

我尝试从此Stack Overflow question重现XML包中的命令。

> library(XML)
> library(RCurl)

> nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
> xml_doc <- xmlParse(nct_url, useInternalNodes=TRUE)
Unknown IO errorfailed to load external entity "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
Error: 1: Unknown IO error2: failed to load external entity "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"

> doc <- xmlTreeParse(getURL(nct_url), useInternalNodes=TRUE)
Error: XML content does not seem to be XML: ''
> getURL(nct_url)
[1] ""

nct_url的数据链接有效且是XML文件。知道这里出了什么问题吗?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE 13.2 (Harlequin) (x86_64)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.8 bitops_1.0-6   XML_3.98-1.4  

1 个答案:

答案 0 :(得分:1)

似乎对我有用(使用xml2):

library(xml2)
library(tidyverse)

doc <- read_xml("https://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true")

doc
## {xml_document}
## <clinical_study>
##  [1] <required_header>\n  <download_date>ClinicalTrials.gov processed th ...
##  [2] <id_info>\n  <org_study_id>ARG-CS3-001</org_study_id>\n  <nct_id>NC ...
##  [3] <brief_title>A Study of the Safety and Efficacy of Nitric Oxide Red ...
##  [4] <official_title>A Phase III International, Multi-Center, Prospectiv ...
##  [5] <sponsors>\n  <lead_sponsor>\n    <agency>Arginox Pharmaceuticals</ ...
##  [6] <source>Arginox Pharmaceuticals</source>
##  [7] <brief_summary>\n  <textblock>\n      Tilarginine Acetate Injection ...
##  [8] <detailed_description>\n  <textblock>\n      An estimated 120,000 t ...
##  [9] <overall_status>Terminated</overall_status>
## [10] <start_date>May 2005</start_date>
## [11] <completion_date>January 2007</completion_date>
## [12] <phase>Phase 3</phase>
## [13] <study_type>Interventional</study_type>
## [14] <study_design_info>\n  <allocation>Randomized</allocation>\n  <inte ...
## [15] <primary_outcome>\n  <measure>All cause mortality at 30 days post r ...
## [16] <secondary_outcome>\n  <measure>Number of patients demonstrating re ...
## [17] <secondary_outcome>\n  <measure>The duration of cardiogenic shock c ...
## [18] <enrollment>658</enrollment>
## [19] <condition>Shock, Cardiogenic</condition>
## [20] <intervention>\n  <intervention_type>Drug</intervention_type>\n  <i ...
## ...

xml_find_all(doc, ".//location") %>%
  map(xml_children) %>%
  map(xml_find_all, ".//*") %>%
  map_df(~as.list(set_names(xml_text(.), xml_name(.)))) %>%
  select(-address) %>%
  glimpse()
## Observations: 102
## Variables: 5
## $ name    <chr> "The Heart Group, PC", "Sparks Regional Medical Center...
## $ city    <chr> "Mobile", "Fort Smith", "Mesa", "Phoenix", "Little Roc...
## $ state   <chr> "Alabama", "Arizona", "Arizona", "Arizona", "Arkansas"...
## $ zip     <chr> "36608", "72901", "85206", "85043", "72205", "90017", ...
## $ country <chr> "United States", "United States", "United States", "Un...