我无法解析此新闻内容

时间:2017-03-06 10:13:24

标签: r xml

我不知道为什么会出现这个错误?我试图以标题,链接,描述,日期和格式等格式解析新闻内容。使用xmlparse函数将其保存在数据框中,但它会抛出错误,如...

site = "http://www.federalreserve.gov/feeds/prates.xml"
doc <- tryCatch(xmlParse(site),  error=function(e) e);      
Unknown IO errorfailed to load external entity    
"http://www.federalreserve.gov/feeds/prates.xml"
src <- xpathApply(xmlRoot(doc), "//item") 
Error in UseMethod("xmlRoot") :no applicable method for 'xmlRoot'applied to an object of class "c('XMLParserErrorList', 'simpleError', 'error',     
'condition')"
for (i in 1:length(src)) {
if (i==1) {
       foo<-xmlSApply(src[[i]], xmlValue)
       temp<-data.frame(t(foo), stringsAsFactors=FALSE)
       DATA=data.frame(title=temp$title,link=temp$link,description=temp$description,pubDate=temp$pubDate)
     }
   else {
       foo<-xmlSApply(src[[i]], xmlValue)
       temp<-data.frame(t(foo), stringsAsFactors=FALSE)
       temp1=data.frame(title=temp$title,link=temp$link,description=temp$description,pubDate=temp$pubDate)
       DATA<-rbind(DATA, temp1)
     }
 }
 Error: object 'src' not found

1 个答案:

答案 0 :(得分:0)

该错误表示网址重定向到HTTPS,如我的评论中所述......

site         <- "http://www.federalreserve.gov/feeds/prates.xml"
correct_site <- "https://www.federalreserve.gov/feeds/prates.xml"

curlGetHeaders(site)
 [1] "HTTP/1.1 301 Moved Permanently\r\n"                                                                                                        
 [2] "Location: https://www.federalreserve.gov/feeds/prates.xml\r\n"                                                                             
 ...    

xmlParse(site)
Unknown IO errorfailed to load external entity "http://www.federalreserve.gov/feeds/prates.xml"

xmlParse无法从https读取,因此请使用readLines(忽略警告)或xml2包或许多其他选项从安全HTTP中读取。

xmlParse( correct_site)
Error: XML content does not seem to be XML: 'https://www.federalreserve.gov/feeds/prates.xml'

x <- readLines(correct_site)
Warning message:
In readLines(correct_site) :
  incomplete final line found on 'https://www.federalreserve.gov/feeds/prates.xml'


xmlParse(x)
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:cb="http://www.cbwiki.net/wiki/index.php/Specification_1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf.xsd">
  <channel rdf:about="http://www.federalreserve.gov/feeds/">
    <title>FRB: DDP: Policy Rates</title>
...

library(xml2)
read_xml( correct_site)

{xml_document}
<RDF schemaLocation="http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf.xsd" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:cb="http://www.cbwiki.net/wiki/index.php/Specification_1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
[1] <channel rdf:about="http://www.federalreserve.gov/feeds/">\n  <title>FRB: DDP: Policy Rates</title>\n   ...
[2] <item rdf:about="http://www.federalreserve.gov/feeds/PRATES.html#1765">\n  <title>Change to the Publica ...
[3] <item rdf:about="http://www.federalreserve.gov/feeds/PRATES.html#953">\n  <title>Change to the Payment  .