如何将网站表中的数据加载到R环境中?

时间:2016-08-26 21:08:30

标签: html r xml

我正在尝试使用R下载http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2016&MONTH=08&FROM=2612&TO=2612&STNM=71203处的数据。根据我的理解,这是一个列表。我试图使用XML包,但继续得到错误'错误(函数(classes,fdef,mtable):无法找到函数'readHTMLList'的继承方法,用于签名'“NULL”''。我使用readHTMLTable()时也会出现同样的错误。这就是我一直在使用该函数的方法:

url = "http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2016&MONTH=08&FROM=2612&TO=2612&STNM=71203"
mydata = read.HTMLTable(url, which = 11, trim = T)

我还尝试在功能选项中加入header = TstringsAsFactors = FreadLines(url)无效。如果我只需要其中一个表,我会手动下载它,但我需要大量的这些数据。我的想法是循环通过URL中的FROM =和TO =,一旦我获得初始功能,就可以访问探测数据的不同日期和时间。任何帮助都会很棒。

2 个答案:

答案 0 :(得分:5)

值得庆幸的是,这是一个包含在<pre>标记中的纯文本表,因此我们可以在HTML中读取,从<pre>标记中提取文本,然后将其读入表中,同时提供正确的列名和类型:

library(rvest)
library(readr)

URL <- "http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2016&MONTH=08&FROM=2612&TO=2612&STNM=71203"
pg <-read_html(URL)
html_nodes(pg, "pre")[[1]] %>% 
  html_text() -> dat

read_table(dat, skip=5, col_types="ddddddddddd",
           col_names=c("pres", "hght", "temp", "dwpt", "relh", "mixr",
                       "drct", "sknt", "thta", "thte", "thtv")) -> df

dplyr::glimpse(df)

## Variables: 11
## $ pres <dbl> 1000.0, 963.0, 962.0, 955.0, 945.8, 944.0, 925.0, 912.8, 891.0, 880.8, 877.0, 850.0, 819.1...
## $ hght <dbl> 130, 456, 465, 527, 610, 626, 800, 914, 1121, 1219, 1256, 1522, 1829, 2134, 2438, 2743, 31...
## $ temp <dbl> NA, 13.2, 15.2, 18.4, 18.9, 19.0, 18.8, 18.2, 17.2, 17.4, 17.4, 15.0, 12.4, 9.8, 7.2, 4.7,...
## $ dwpt <dbl> NA, 8.8, 9.2, 8.4, 7.2, 7.0, 6.8, 6.2, 5.2, 5.3, 5.4, 4.0, 2.8, 1.6, 0.4, -0.9, -2.3, -2.5...
## $ relh <dbl> NA, 75, 67, 52, 47, 46, 46, 45, 45, 45, 45, 48, 52, 56, 62, 67, 75, 75, 73, 70, 68, 23, 17...
## $ mixr <dbl> NA, 7.43, 7.64, 7.29, 6.79, 6.70, 6.74, 6.57, 6.26, 6.40, 6.45, 6.03, 5.74, 5.46, 5.19, 4....
## $ drct <dbl> NA, 240, 247, 295, 0, 15, 175, 170, 72, 25, 22, 0, 335, 300, 290, 300, 300, 300, 300, 319,...
## $ sknt <dbl> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 3, 5, 5, 7, 6, 6, 6, 10, 10, 21, 21, 21, 21, 21, 21, ...
## $ thta <dbl> NA, 289.4, 291.6, 295.4, 296.7, 297.0, 298.5, 299.1, 300.1, 301.2, 301.6, 301.9, 302.3, 30...
## $ thte <dbl> NA, 310.7, 313.6, 316.8, 316.9, 316.9, 318.6, 318.8, 319.0, 320.6, 321.2, 320.2, 319.8, 31...
## $ thtv <dbl> NA, 290.8, 292.9, 296.7, 298.0, 298.2, 299.7, 300.3, 301.2, 302.4, 302.8, 302.9, 303.4, 30...

答案 1 :(得分:2)

使用rvest和readr软件包:

> txt = read_html(url) %>% html_node("pre") %>% html_text()

<pre>标记内获取文本。然后:

> data = txt %>% read_fwf(fwf_empty(.,skip=5),skip=5)

制作一个数据框:

> head(data)
      X1  X2   X3  X4 X5   X6  X7 X8    X9   X10   X11
1 1000.0 130   NA  NA NA   NA  NA NA    NA    NA    NA
2  963.0 456 13.2 8.8 75 7.43 240  1 289.4 310.7 290.8
3  962.0 465 15.2 9.2 67 7.64 247  1 291.6 313.6 292.9
4  955.0 527 18.4 8.4 52 7.29 295  1 295.4 316.8 296.7
5  945.8 610 18.9 7.2 47 6.79   0  1 296.7 316.9 298.0
6  944.0 626 19.0 7.0 46 6.70  15  1 297.0 316.9 298.2

获取名称留给读者练习......