来自R

时间:2015-07-09 12:53:13

标签: r

我正在尝试从以下网址访问XML数据:

forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML

当我在浏览器中打开它时,我可以看到完整的数据:

<dwml xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0" xsi:noNamespaceSchemaLocation="http://graphical.weather.gov/xml/DWMLgen/schema/DWML.xsd">
<head>
<product concise-name="tabular-digital" operational-mode="developmental" srsName="WGS 1984">
<creation-date refresh-frequency="PT1H">2015-07-09T07:15:40-04:00</creation-date>
</product>
<source>
<production-center>Jacksonville, FL</production-center>
<credit>http://www.srh.noaa.gov/jax</credit>
<more-information>http://www.nws.noaa.gov/forecasts/xml/</more-information>
</source>
</head>
<data>
...

但是,当我尝试下载到R时,我收到以下错误:

require(XML)
testxml <- xmlParse("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
Error: XML content does not seem to be XML: 'forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML'

使用RCurl:

require(RCurl)
testurl <- getURL("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
testurl
 "<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http&#58;&#47;&#47;forecast&#46;weather&#46;gov&#47;MapClick&#46;php&#63;\" on this server.<P>\nReference&#32;&#35;18&#46;34e722cf&#46;1436445794&#46;3ebde80\n</BODY>\n</HTML>\n"

我假设这两个问题是相关的。 RCurl正如我所尝试的其他网站一样正常工作。我想了解为什么会发生这种情况。

1 个答案:

答案 0 :(得分:2)

通过在HTTP标头中设置User-Agent属性,我可以使用RCurl从weather.gov获取XML数据。例如:

require(RCurl)
testurl <- getURL("forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML", httpheader = c("User-Agent"="Mozilla/5.0 (Windows NT 6.1; WOW64)"))

然后testurl将包含在网页浏览器中输入网址时返回的相同XML。

如果您需要进行试验,可以在http://www.useragentstring.com/pages/Browserlist/找到各种浏览器的用户代理列表。