我想从 https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1 中提取底部表格(“每日观察”)。我得到了表组件的完整 xpath,但它显示 {xml_nodeset (0)}
作为输出。我在这里做错了什么?我使用了以下代码:
library(rvest)
single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')
single %>%
html_nodes(xpath = '/html/body/app-root/app-history/one-column-layout/wu-header/sidenav/mat-sidenav-container/mat-sidenav-content/div/section/div[2]/div/div[5]/div/div/lib-city-history-observation/div/div[2]/table')
看起来表格组件是空的。
答案 0 :(得分:2)
这是一个动态页面,表格由Javascript生成。
仅 rvest
是不够的。尽管如此,您还是可以从 JSON API 获取源内容。
library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)
# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.
htm_obj <-
read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')
# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <-
html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )
# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
"https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
str_apikey,
"&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)
# Wrangle the JSON's contents into the table you need
tbl_daily <-
json_obj$observations %>% as_tibble() %>%
mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>%
with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table
# A tibble: 34 x 8
valid_time_gmt temp dewPt rh wdir_cardinal gust pressure precip_hrly
<dttm> <int> <int> <int> <chr> <lgl> <dbl> <dbl>
1 2010-12-31 23:52:00 38 NA 79 CALM NA 30.1 NA
2 2011-01-01 00:52:00 35 31 85 CALM NA 30.1 NA
3 2011-01-01 01:52:00 36 31 82 CALM NA 30.1 NA
4 2011-01-01 02:52:00 37 31 79 CALM NA 30.1 NA
5 2011-01-01 03:52:00 36 30 79 CALM NA 30.1 NA
6 2011-01-01 04:52:00 37 30 76 NNE NA 30.1 NA
7 2011-01-01 05:52:00 36 30 79 CALM NA 30.1 NA
8 2011-01-01 06:52:00 34 30 85 CALM NA 30.1 NA
9 2011-01-01 07:52:00 37 31 79 CALM NA 30.1 NA
10 2011-01-01 08:52:00 44 38 79 CALM NA 30.1 NA
# ... with 24 more rows