我正在尝试从以下网址访问XML数据:
forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML
当我在浏览器中打开它时,我可以看到完整的数据:
<dwml xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0" xsi:noNamespaceSchemaLocation="http://graphical.weather.gov/xml/DWMLgen/schema/DWML.xsd">
<head>
<product concise-name="tabular-digital" operational-mode="developmental" srsName="WGS 1984">
<creation-date refresh-frequency="PT1H">2015-07-09T07:15:40-04:00</creation-date>
</product>
<source>
<production-center>Jacksonville, FL</production-center>
<credit>http://www.srh.noaa.gov/jax</credit>
<more-information>http://www.nws.noaa.gov/forecasts/xml/</more-information>
</source>
</head>
<data>
...
但是,当我尝试下载到R时,我收到以下错误:
require(XML)
testxml <- xmlParse("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
Error: XML content does not seem to be XML: 'forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML'
使用RCurl:
require(RCurl)
testurl <- getURL("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
testurl
"<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http://forecast.weather.gov/MapClick.php?\" on this server.<P>\nReference #18.34e722cf.1436445794.3ebde80\n</BODY>\n</HTML>\n"
我假设这两个问题是相关的。 RCurl正如我所尝试的其他网站一样正常工作。我想了解为什么会发生这种情况。
答案 0 :(得分:2)
通过在HTTP标头中设置User-Agent
属性,我可以使用RCurl从weather.gov获取XML数据。例如:
require(RCurl)
testurl <- getURL("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML", httpheader = c("User-Agent"="Mozilla/5.0 (Windows NT 6.1; WOW64)"))
然后testurl
将包含在网页浏览器中输入网址时返回的相同XML。
如果您需要进行试验,可以在http://www.useragentstring.com/pages/Browserlist/找到各种浏览器的用户代理列表。