我正在尝试使用readHTMLTable从NOAA中提取模型数据。我试图得到的表有多个字幕,其中每个字幕由跨越所有列的单个单元组成,据我从HTML中可以看出。出于某种原因,这导致readHTMLTable忽略紧跟字幕后面的行。以下代码将重现此问题:
library(XML)
url <- "http://nomads.ncep.noaa.gov/"
ncep.tables = readHTMLTable(url, header=TRUE)
#Find the list of real time models
for(ncep.table in ncep.tables) {
if("grib filter" %in% names(ncep.table) & "gds-alt" %in% names(ncep.table)) {
rt.tbl <- ncep.table
}
}
#Here's where the problem is:
cat(paste(rt.tbl[["Data Set"]][15:20], collapse = "\n"))
#On the website, there is a model called "AQM Daily Maximum"
#between Regional Models and AQM Hourly Surface Ozone
#but it's missing now...
因此,如果您转到http://nomads.ncep.noaa.gov/并查看中央表格(右侧单元格中的&#34;数据集&#34;),您会看到一个名为&的副标题#34;区域模型。&#34;在上面的代码中提取期间,跳过字幕正下方的AQM每日最大模型。
我在R中维护了rNOMADS包,所以如果我可以使用它,它将节省我维护包的时间,并保持其准确和最新的用户。谢谢你的帮助!
答案 0 :(得分:0)
通过高兴,我想我明白了。你不能使用readHTMLTable
(而且,我现在比以前更了解XML包代码的方式......在那个代码中有一些严重的R-fu)而且我正在使用{ {1}}只是因为我混合使用XPath和CSS选择器(我最终在XPath中考虑更多)。 rvest
仅适用于dplyr
。
gimpse
请确保列匹配。我用它来表达,但验证会很棒。注意:可能有更好的方法来执行library(XML)
library(dplyr)
library(rvest)
trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)
# neither rvest::html nor rvest::html_session liked it, hence using XML::htmlParse
doc <- htmlParse("http://nomads.ncep.noaa.gov/")
ds <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/
preceding-sibling::td[3]")
data_set <- ds %>% html_text() %>% trim()
data_set_descr_link <- ds %>% html_nodes("a") %>% html_attr("href")
freq <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'hourly') or
contains(., 'hours') or
contains(., 'daily') or
contains(., '06Z')]") %>%
html_text() %>% trim()
grib_filter <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/preceding-sibling::td[1]") %>%
sapply(function(x) {
ifelse(x %>% xpathApply("boolean(./a)"),
x %>% html_node("a") %>% html_attr("href"),
NA)
})
http_link <- doc %>% html_nodes("a[href^='/pub/data/']") %>% html_attr("href")
gds_alt <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
descendant::td[contains(., 'http')]/following-sibling::td[1]") %>%
sapply(function(x) {
ifelse(x %>% xpathApply("boolean(./a)"),
x %>% html_node("a") %>% html_attr("href"),
NA)
})
nom <- data.frame(data_set,
data_set_descr_link,
freq,
grib_filter,
gds_alt)
glimpse(nom)
## Variables:
## $ data_set (fctr) FNL, GFS 1.0x1.0 Degree, GFS 0.5x0.5 Degr...
## $ data_set_descr_link (fctr) txt_descriptions/fnl_doc.shtml, txt_descr...
## $ freq (fctr) 6 hours, 6 hours, 6 hours, 12 hours, 6 ho...
## $ grib_filter (fctr) cgi-bin/filter_fnl.pl, cgi-bin/filter_gfs...
## $ gds_alt (fctr) dods-alt/fnl, dods-alt/gfs, dods-alt/gfs_...
head(nom)
## data_set
## 1 FNL
## 2 GFS 1.0x1.0 Degree
## 3 GFS 0.5x0.5 Degree
## 4 GFS 2.5x2.5 Degree
## 5 GFS Ensemble high resolution
## 6 GFS Ensemble Precip Bias-Corrected
##
## data_set_descr_link freq
## 1 txt_descriptions/fnl_doc.shtml 6 hours
## 2 txt_descriptions/GFS_high_resolution_doc.shtml 6 hours
## 3 txt_descriptions/GFS_half_degree_doc.shtml 6 hours
## 4 txt_descriptions/GFS_Low_Resolution_doc.shtml 12 hours
## 5 txt_descriptions/GFS_Ensemble_high_resolution_doc.shtml 6 hours
## 6 txt_descriptions/GFS_Ensemble_precip_bias_corrected_doc.shtml daily
##
## grib_filter gds_alt
## 1 cgi-bin/filter_fnl.pl dods-alt/fnl
## 2 cgi-bin/filter_gfs.pl dods-alt/gfs
## 3 cgi-bin/filter_gfs_hd.pl dods-alt/gfs_hd
## 4 cgi-bin/filter_gfs_2p5.pl dods-alt/gfs_2p5
## 5 cgi-bin/filter_gens.pl dods-alt/gens
## 6 cgi-bin/filter_gensbc_precip.pl dods-alt/gens_bc
(任何人都可以自由编辑,也可以自己记录)。
它的真正脆弱的代码。即如果格式发生变化,它就会变暗(但对于所有拼写来说都是如此)。它应该能够承受它们实际创建有效的HTML(这是可怕的HTML btw),但是大多数代码依赖于sapply
列保持有效,因为其他列提取的大多数依赖它。你的遗失模型也在那里。如果任何XPath令人困惑,请删除评论q,然后我会尝试“splain”。
答案 1 :(得分:0)
有时您只需要修复错误的HTML,因此您可以在这些行的开头添加tr标记。
url <- "http://nomads.ncep.noaa.gov/"
x <- readLines(url, encoding="UTF-8")
doc <- htmlParse(x)
# check nodes after subheaders - only 2 of 5 rows missing tr (2nd and 3rd element)
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
# fix text - probably some way to fix XML doc too?
n <- grep(">AQM Daily Maximum<", x)
x[n] <- paste0("<tr>", x[n])
n <- grep(">RTOFS Atlantic<", x)
x[n] <- paste0("<tr>", x[n])
doc <- htmlParse(x)
## ok..
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
readHTMLTable(doc, which=9, header=TRUE)
Data Set freq grib filter http gds-alt
1 Global Models <NA> <NA> <NA> <NA>
2 FNL 6 hours grib filter http OpenDAP-alt
3 GFS 1.0x1.0 Degree 6 hours grib filter http OpenDAP-alt
...
16 Climate Forecast System 3D Pressure Products 6 hours grib filter http -
17 Regional Models <NA> <NA> <NA> <NA>
18 AQM Daily Maximum 06Z, 12Z grib filter http OpenDAP-alt
19 AQM Hourly Surface Ozone 06Z, 12Z grib filter http OpenDAP-alt
20 HIRES Alaska daily grib filter http OpenDAP-alt