我尝试从此Stack Overflow question重现XML
包中的命令。
> library(XML)
> library(RCurl)
> nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
> xml_doc <- xmlParse(nct_url, useInternalNodes=TRUE)
Unknown IO errorfailed to load external entity "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
Error: 1: Unknown IO error2: failed to load external entity "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
> doc <- xmlTreeParse(getURL(nct_url), useInternalNodes=TRUE)
Error: XML content does not seem to be XML: ''
> getURL(nct_url)
[1] ""
nct_url的数据链接有效且是XML文件。知道这里出了什么问题吗?
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE 13.2 (Harlequin) (x86_64)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.8 bitops_1.0-6 XML_3.98-1.4
答案 0 :(得分:1)
似乎对我有用(使用xml2
):
library(xml2)
library(tidyverse)
doc <- read_xml("https://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true")
doc
## {xml_document}
## <clinical_study>
## [1] <required_header>\n <download_date>ClinicalTrials.gov processed th ...
## [2] <id_info>\n <org_study_id>ARG-CS3-001</org_study_id>\n <nct_id>NC ...
## [3] <brief_title>A Study of the Safety and Efficacy of Nitric Oxide Red ...
## [4] <official_title>A Phase III International, Multi-Center, Prospectiv ...
## [5] <sponsors>\n <lead_sponsor>\n <agency>Arginox Pharmaceuticals</ ...
## [6] <source>Arginox Pharmaceuticals</source>
## [7] <brief_summary>\n <textblock>\n Tilarginine Acetate Injection ...
## [8] <detailed_description>\n <textblock>\n An estimated 120,000 t ...
## [9] <overall_status>Terminated</overall_status>
## [10] <start_date>May 2005</start_date>
## [11] <completion_date>January 2007</completion_date>
## [12] <phase>Phase 3</phase>
## [13] <study_type>Interventional</study_type>
## [14] <study_design_info>\n <allocation>Randomized</allocation>\n <inte ...
## [15] <primary_outcome>\n <measure>All cause mortality at 30 days post r ...
## [16] <secondary_outcome>\n <measure>Number of patients demonstrating re ...
## [17] <secondary_outcome>\n <measure>The duration of cardiogenic shock c ...
## [18] <enrollment>658</enrollment>
## [19] <condition>Shock, Cardiogenic</condition>
## [20] <intervention>\n <intervention_type>Drug</intervention_type>\n <i ...
## ...
xml_find_all(doc, ".//location") %>%
map(xml_children) %>%
map(xml_find_all, ".//*") %>%
map_df(~as.list(set_names(xml_text(.), xml_name(.)))) %>%
select(-address) %>%
glimpse()
## Observations: 102
## Variables: 5
## $ name <chr> "The Heart Group, PC", "Sparks Regional Medical Center...
## $ city <chr> "Mobile", "Fort Smith", "Mesa", "Phoenix", "Little Roc...
## $ state <chr> "Alabama", "Arizona", "Arizona", "Arizona", "Arkansas"...
## $ zip <chr> "36608", "72901", "85206", "85043", "72205", "90017", ...
## $ country <chr> "United States", "United States", "United States", "Un...