我正在尝试从FiveThirtyEight's presidential approval rating抓取数据,以将日期,民意调查,样本量和百分比放入R的数据框中。我的第一次尝试是使用html_nodes的方法:
pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")
pres_approval <- pres_approval %>%
html_nodes(css = "table") %>%
nth(2) %>%
html_table(header = TRUE, fill = TRUE)
哪个回来了
nodes_duplicated(nodes)中的错误:需要一个外部指针:[type = NULL]。
然后再次使用“选择器”小工具:
pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")`
pres_approval <- pres_approval %>%
html_nodes(css = "td , .heat-map , .pollster a") %>%
nth(2) %>%
html_table(header = TRUE, fill = TRUE)`
哪个回来了
html_table.xml_node(。,标头= TRUE,填充= TRUE)中的错误:html_name(x)==“表”不是TRUE`
我可以从这里做什么?
答案 0 :(得分:1)
它们通常通过XHR请求异步加载数据,您可以查看是否在浏览器中打开Developer Tools并重新加载页面。在网络-> XHR中,您会看到很多可爱的JSON:
我不知道要哪个(我略过Q),但是您可以轻松获取所有主要的JSON文件:
polls <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/polls.json")
str(polls, 1)
## 'data.frame': 3401 obs. of 14 variables:
## $ id : int 77261 77265 77272 77249 77257 77266 77596 77246 77263 77253 ...
## $ subgroup : chr "All polls" "All polls" "All polls" "All polls" ...
## $ sampleSize : int 1992 1500 1190 1043 1500 2692 1712 1500 1500 1991 ...
## $ population : chr "rv" "a" "rv" "rv" ...
## $ weight : num 0.946 0.245 1.645 1.166 0.639 ...
## $ grade : chr "B-" "B" "A-" "B" ...
## $ multiversions: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ url : chr "http://www.politico.com/story/2017/01/poll-voters-liked-trumps-inaugural-address-234148" "http://www.gallup.com/poll/201617/gallup-daily-trump-job-approval.aspx" "https://poll.qu.edu/national/release-detail?ReleaseID=2415" "http://www.publicpolicypolling.com/pdf/2015/PPP_Release_National_12617.pdf" ...
## $ created_at : chr "2017-01-23" "2017-01-23" "2017-01-26" "2017-01-25" ...
## $ startDate : chr "2017-01-20" "2017-01-20" "2017-01-20" "2017-01-23" ...
## $ endDate : chr "2017-01-22" "2017-01-22" "2017-01-25" "2017-01-24" ...
## $ pollster : chr "Morning Consult" "Gallup" "Quinnipiac University" "Public Policy Polling" ...
## $ tracking : chr "" "T" "" "" ...
## $ answers :List of 3401
approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/approval.json")
str(approval, 1)
## 'data.frame': 2751 obs. of 9 variables:
## $ date : chr "2017-01-23" "2017-01-23" "2017-01-23" "2017-01-24" ...
## $ future : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ subgroup : chr "Adults" "All polls" "Voters" "Adults" ...
## $ approve_estimate : chr "45" "45.46693" "46" "45" ...
## $ approve_hi : chr "51.1347" "50.88971" "52.29238" "50.98562" ...
## $ approve_lo : chr "38.8653" "40.04416" "39.70762" "39.01438" ...
## $ disapprove_estimate: chr "45" "41.26452" "37" "45.74659" ...
## $ disapprove_hi : chr "51.1347" "46.68729" "43.29238" "51.73221" ...
## $ disapprove_lo : chr "38.8653" "35.84175" "30.70762" "39.76097" ...
historic_approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/historical-approval.json")
str(historic_approval, 1)
## 'data.frame': 26001 obs. of 6 variables:
## $ president : chr "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" ...
## $ date : chr "1945-06-06" "1945-06-07" "1945-06-08" "1945-06-09" ...
## $ days : int 55 56 57 58 59 60 61 62 63 64 ...
## $ subgroup : chr "All polls" "All polls" "All polls" "All polls" ...
## $ approve_estimate : chr "87" "87" "87" "87" ...
## $ disapprove_estimate: chr "3" "3" "3" "3" ...
我将通过readr::type_convert()
运行结果数据帧以获得更好的类型。
答案 1 :(得分:0)
@hrbrmstr 的答案是获得所需桌子的更简洁方法。提取 JSON 文件后的表可用于任何目的。
当我使用 nth(1) 而不是 nth(2) 重现您的示例时,我得到了表格。
以下是拜登支持率的示例:
pres_approval <- read_html("https://projects.fivethirtyeight.com/biden-approval-rating")
pres_approval <- pres_approval %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "polls", " " ))]') %>%
nth(1) %>%
html_table(header = TRUE, fill = TRUE)
打印(pres_approval)
# A tibble: 15 x 15
`` DATES POLLSTER GRADE SAMPLE SAMPLE WEIGHT APPROVE APPROVE DISAPPROVE DISAPPROVE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <chr>
1 • Jun. 1~ Ipsos "B-" 1,002 A 0.86 52% 42% NA 50%
2 • Jun. 1~ YouGov "B+" 1,500 A 1.42 48% 43% NA 49%
3 • Jun. 9~ Morning Co~ "B" 15,000 A 1.95 53% 39% NA 52%
4 • Jun. 1~ AP-NORC "" 1,125 A 1.51 55% 44% NA 51%
5 • Jun. 9~ Monmouth U~ "A" 810 A 1.53 48% 43% NA 48%
6 • Jun. 1~ Ipsos "B-" 1,002 A 0.86 52% 42% NA 51%
7 • Jun. 1~ Rasmussen ~ "B" 1,500 LV 1.3 51% 48% NA 54%
8 • Jun. 1~ YouGov "B+" 1,500 A 1.42 48% 43% NA 49%
9 • Jun. 9~ Morning Co~ "B" 15,000 A 1.8 53% 39% NA 52%
10 • Jun. 1~ AP-NORC "" 1,125 A 1.51 55% 44% NA 51%
11 • Jun. 1~ Rasmussen ~ "B" 1,500 LV 1.3 51% 48% NA 54%
12 • Jun. 1~ YouGov "B+" 1,305 RV 1.38 48% 45% NA 49%
13 • Jun. 1~ Rasmussen ~ "B" 1,500 LV 0.77 49% 49% NA 52%
14 • Jun. 1~ Global Str~ "B/C" 1,001 RV 1 52% 44% NA 52%
15 • Jun. 9~ Monmouth U~ "A" 758 RV 1.45 49% 43% NA 50%
只有一张桌子,所以你应该将第 n 个设置为 1。
如果你仔细观察,你会发现数据表需要一些简单的重命名列和摆脱其他空列的争论。此外,该表仅包含 15 行。
要获得完整的表格,您可以使用 RSelenium 来扩展表格并完全捕获它。