使用R中的readHTMLTable删除行

时间:2014-12-21 18:46:50

标签: xml r web-scraping

我正在尝试使用readHTMLTable从NOAA中提取模型数据。我试图得到的表有多个字幕,其中每个字幕由跨越所有列的单个单元组成,据我从HTML中可以看出。出于某种原因,这导致readHTMLTable忽略紧跟字幕后面的行。以下代码将重现此问题:

library(XML)

url <- "http://nomads.ncep.noaa.gov/"
ncep.tables = readHTMLTable(url, header=TRUE)

#Find the list of real time models
for(ncep.table in ncep.tables) {
    if("grib filter" %in% names(ncep.table) & "gds-alt" %in% names(ncep.table)) {
        rt.tbl <- ncep.table
     }
}

#Here's where the problem is:
cat(paste(rt.tbl[["Data Set"]][15:20], collapse = "\n"))

#On the website, there is a model called "AQM Daily Maximum"
#between Regional Models and AQM Hourly Surface Ozone
#but it's missing now...

因此,如果您转到http://nomads.ncep.noaa.gov/并查看中央表格(右侧单元格中的&#34;数据集&#34;),您会看到一个名为&的副标题#34;区域模型。&#34;在上面的代码中提取期间,跳过字幕正下方的AQM每日最大模型。

我在R中维护了rNOMADS包,所以如果我可以使用它,它将节省我维护包的时间,并保持其准确和最新的用户。谢谢你的帮助!

2 个答案:

答案 0 :(得分:0)

通过高兴,我想我明白了。你不能使用readHTMLTable(而且,我现在比以前更了解XML包代码的方式......在那个代码中有一些严重的R-fu)而且我正在使用{ {1}}只是因为我混合使用XPath和CSS选择器(我最终在XPath中考虑更多)。 rvest仅适用于dplyr

gimpse

请确保列匹配。我用它来表达,但验证会很棒。注意:可能有更好的方法来执行library(XML) library(dplyr) library(rvest) trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x) # neither rvest::html nor rvest::html_session liked it, hence using XML::htmlParse doc <- htmlParse("http://nomads.ncep.noaa.gov/") ds <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../ descendant::td[contains(., 'http')]/ preceding-sibling::td[3]") data_set <- ds %>% html_text() %>% trim() data_set_descr_link <- ds %>% html_nodes("a") %>% html_attr("href") freq <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../ descendant::td[contains(., 'hourly') or contains(., 'hours') or contains(., 'daily') or contains(., '06Z')]") %>% html_text() %>% trim() grib_filter <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../ descendant::td[contains(., 'http')]/preceding-sibling::td[1]") %>% sapply(function(x) { ifelse(x %>% xpathApply("boolean(./a)"), x %>% html_node("a") %>% html_attr("href"), NA) }) http_link <- doc %>% html_nodes("a[href^='/pub/data/']") %>% html_attr("href") gds_alt <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../ descendant::td[contains(., 'http')]/following-sibling::td[1]") %>% sapply(function(x) { ifelse(x %>% xpathApply("boolean(./a)"), x %>% html_node("a") %>% html_attr("href"), NA) }) nom <- data.frame(data_set, data_set_descr_link, freq, grib_filter, gds_alt) glimpse(nom) ## Variables: ## $ data_set (fctr) FNL, GFS 1.0x1.0 Degree, GFS 0.5x0.5 Degr... ## $ data_set_descr_link (fctr) txt_descriptions/fnl_doc.shtml, txt_descr... ## $ freq (fctr) 6 hours, 6 hours, 6 hours, 12 hours, 6 ho... ## $ grib_filter (fctr) cgi-bin/filter_fnl.pl, cgi-bin/filter_gfs... ## $ gds_alt (fctr) dods-alt/fnl, dods-alt/gfs, dods-alt/gfs_... head(nom) ## data_set ## 1 FNL ## 2 GFS 1.0x1.0 Degree ## 3 GFS 0.5x0.5 Degree ## 4 GFS 2.5x2.5 Degree ## 5 GFS Ensemble high resolution ## 6 GFS Ensemble Precip Bias-Corrected ## ## data_set_descr_link freq ## 1 txt_descriptions/fnl_doc.shtml 6 hours ## 2 txt_descriptions/GFS_high_resolution_doc.shtml 6 hours ## 3 txt_descriptions/GFS_half_degree_doc.shtml 6 hours ## 4 txt_descriptions/GFS_Low_Resolution_doc.shtml 12 hours ## 5 txt_descriptions/GFS_Ensemble_high_resolution_doc.shtml 6 hours ## 6 txt_descriptions/GFS_Ensemble_precip_bias_corrected_doc.shtml daily ## ## grib_filter gds_alt ## 1 cgi-bin/filter_fnl.pl dods-alt/fnl ## 2 cgi-bin/filter_gfs.pl dods-alt/gfs ## 3 cgi-bin/filter_gfs_hd.pl dods-alt/gfs_hd ## 4 cgi-bin/filter_gfs_2p5.pl dods-alt/gfs_2p5 ## 5 cgi-bin/filter_gens.pl dods-alt/gens ## 6 cgi-bin/filter_gensbc_precip.pl dods-alt/gens_bc (任何人都可以自由编辑,也可以自己记录)。

它的真正脆弱的代码。即如果格式发生变化,它就会变暗(但对于所有拼写来说都是如此)。它应该能够承受它们实际创建有效的HTML(这是可怕的HTML btw),但是大多数代码依赖于sapply列保持有效,因为其他列提取的大多数依赖它。你的遗失模型也在那里。如果任何XPath令人困惑,请删除评论q,然后我会尝试“splain”。

答案 1 :(得分:0)

有时您只需要修复错误的HTML,因此您可以在这些行的开头添加tr标记。

url <- "http://nomads.ncep.noaa.gov/"
x <- readLines(url, encoding="UTF-8")
doc <- htmlParse(x)

# check nodes after subheaders - only 2 of 5 rows missing tr (2nd and 3rd element)
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
# fix text - probably some way to fix XML doc too?
n <- grep(">AQM Daily Maximum<", x)
x[n] <- paste0("<tr>", x[n])
n <- grep(">RTOFS Atlantic<", x)
x[n] <- paste0("<tr>", x[n])

doc <- htmlParse(x)
## ok..
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
readHTMLTable(doc, which=9, header=TRUE)

                                      Data Set     freq grib filter http     gds-alt
1                                 Global Models     <NA>        <NA> <NA>        <NA>
2                                           FNL  6 hours grib filter http OpenDAP-alt
3                            GFS 1.0x1.0 Degree  6 hours grib filter http OpenDAP-alt
...
16 Climate Forecast System 3D Pressure Products  6 hours grib filter http           -
17                              Regional Models     <NA>        <NA> <NA>        <NA>
18                            AQM Daily Maximum 06Z, 12Z grib filter http OpenDAP-alt
19                     AQM Hourly Surface Ozone 06Z, 12Z grib filter http OpenDAP-alt
20                                 HIRES Alaska    daily grib filter http OpenDAP-alt