尝试从FiveThirtyEight抓取数据时遇到错误

时间:2018-11-23 02:31:19

标签: r web-scraping rvest

我正在尝试从FiveThirtyEight's presidential approval rating抓取数据,以将日期,民意调查,样本量和百分比放入R的数据框中。我的第一次尝试是使用html_nodes的方法:

pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")

pres_approval <- pres_approval %>%
                     html_nodes(css = "table") %>%
                     nth(2) %>%
                     html_table(header = TRUE, fill = TRUE)

哪个回来了

  

nodes_duplicated(nodes)中的错误:需要一个外部指针:[type = NULL]。

然后再次使用“选择器”小工具:

 pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")`

 pres_approval <- pres_approval %>%
                      html_nodes(css = "td , .heat-map , .pollster a") %>%
                      nth(2) %>%
                      html_table(header = TRUE, fill = TRUE)`

哪个回来了

  

html_table.xml_node(。,标头= TRUE,填充= TRUE)中的错误:html_name(x)==“表”不是TRUE`

我可以从这里做什么?

2 个答案:

答案 0 :(得分:1)

它们通常通过XHR请求异步加载数据,您可以查看是否在浏览器中打开Developer Tools并重新加载页面。在网络-> XHR中,您会看到很多可爱的JSON:

enter image description here

我不知道要哪个(我略过Q),但是您可以轻松获取所有主要的JSON文件:

polls <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/polls.json")

str(polls, 1)
## 'data.frame': 3401 obs. of  14 variables:
##  $ id           : int  77261 77265 77272 77249 77257 77266 77596 77246 77263 77253 ...
##  $ subgroup     : chr  "All polls" "All polls" "All polls" "All polls" ...
##  $ sampleSize   : int  1992 1500 1190 1043 1500 2692 1712 1500 1500 1991 ...
##  $ population   : chr  "rv" "a" "rv" "rv" ...
##  $ weight       : num  0.946 0.245 1.645 1.166 0.639 ...
##  $ grade        : chr  "B-" "B" "A-" "B" ...
##  $ multiversions: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ url          : chr  "http://www.politico.com/story/2017/01/poll-voters-liked-trumps-inaugural-address-234148" "http://www.gallup.com/poll/201617/gallup-daily-trump-job-approval.aspx" "https://poll.qu.edu/national/release-detail?ReleaseID=2415" "http://www.publicpolicypolling.com/pdf/2015/PPP_Release_National_12617.pdf" ...
##  $ created_at   : chr  "2017-01-23" "2017-01-23" "2017-01-26" "2017-01-25" ...
##  $ startDate    : chr  "2017-01-20" "2017-01-20" "2017-01-20" "2017-01-23" ...
##  $ endDate      : chr  "2017-01-22" "2017-01-22" "2017-01-25" "2017-01-24" ...
##  $ pollster     : chr  "Morning Consult" "Gallup" "Quinnipiac University" "Public Policy Polling" ...
##  $ tracking     : chr  "" "T" "" "" ...
##  $ answers      :List of 3401

approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/approval.json")

str(approval, 1)
## 'data.frame': 2751 obs. of  9 variables:
##  $ date               : chr  "2017-01-23" "2017-01-23" "2017-01-23" "2017-01-24" ...
##  $ future             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ subgroup           : chr  "Adults" "All polls" "Voters" "Adults" ...
##  $ approve_estimate   : chr  "45" "45.46693" "46" "45" ...
##  $ approve_hi         : chr  "51.1347" "50.88971" "52.29238" "50.98562" ...
##  $ approve_lo         : chr  "38.8653" "40.04416" "39.70762" "39.01438" ...
##  $ disapprove_estimate: chr  "45" "41.26452" "37" "45.74659" ...
##  $ disapprove_hi      : chr  "51.1347" "46.68729" "43.29238" "51.73221" ...
##  $ disapprove_lo      : chr  "38.8653" "35.84175" "30.70762" "39.76097" ...

historic_approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/historical-approval.json")

str(historic_approval, 1)
## 'data.frame': 26001 obs. of  6 variables:
##  $ president          : chr  "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" ...
##  $ date               : chr  "1945-06-06" "1945-06-07" "1945-06-08" "1945-06-09" ...
##  $ days               : int  55 56 57 58 59 60 61 62 63 64 ...
##  $ subgroup           : chr  "All polls" "All polls" "All polls" "All polls" ...
##  $ approve_estimate   : chr  "87" "87" "87" "87" ...
##  $ disapprove_estimate: chr  "3" "3" "3" "3" ...

我将通过readr::type_convert()运行结果数据帧以获得更好的类型。

答案 1 :(得分:0)

@hrbrmstr 的答案是获得所需桌子的更简洁方法。提取 JSON 文件后的表可用于任何目的。

当我使用 nth(1) 而不是 nth(2) 重现您的示例时,我得到了表格。

以下是拜登支持率的示例:

pres_approval <- read_html("https://projects.fivethirtyeight.com/biden-approval-rating")

pres_approval <- pres_approval %>%
  html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "polls", " " ))]') %>%
  nth(1) %>%
  html_table(header = TRUE, fill = TRUE)

打印(pres_approval)

# A tibble: 15 x 15
   ``    DATES   POLLSTER    GRADE SAMPLE SAMPLE WEIGHT APPROVE APPROVE DISAPPROVE DISAPPROVE
   <chr> <chr>   <chr>       <chr> <chr>  <chr>   <dbl> <chr>   <chr>   <lgl>      <chr>     
 1 •     Jun. 1~ Ipsos       "B-"  1,002  A        0.86 52%     42%     NA         50%       
 2 •     Jun. 1~ YouGov      "B+"  1,500  A        1.42 48%     43%     NA         49%       
 3 •     Jun. 9~ Morning Co~ "B"   15,000 A        1.95 53%     39%     NA         52%       
 4 •     Jun. 1~ AP-NORC     ""    1,125  A        1.51 55%     44%     NA         51%       
 5 •     Jun. 9~ Monmouth U~ "A"   810    A        1.53 48%     43%     NA         48%       
 6 •     Jun. 1~ Ipsos       "B-"  1,002  A        0.86 52%     42%     NA         51%       
 7 •     Jun. 1~ Rasmussen ~ "B"   1,500  LV       1.3  51%     48%     NA         54%       
 8 •     Jun. 1~ YouGov      "B+"  1,500  A        1.42 48%     43%     NA         49%       
 9 •     Jun. 9~ Morning Co~ "B"   15,000 A        1.8  53%     39%     NA         52%       
10 •     Jun. 1~ AP-NORC     ""    1,125  A        1.51 55%     44%     NA         51%       
11 •     Jun. 1~ Rasmussen ~ "B"   1,500  LV       1.3  51%     48%     NA         54%       
12 •     Jun. 1~ YouGov      "B+"  1,305  RV       1.38 48%     45%     NA         49%       
13 •     Jun. 1~ Rasmussen ~ "B"   1,500  LV       0.77 49%     49%     NA         52%       
14 •     Jun. 1~ Global Str~ "B/C" 1,001  RV       1    52%     44%     NA         52%       
15 •     Jun. 9~ Monmouth U~ "A"   758    RV       1.45 49%     43%     NA         50%

只有一张桌子,所以你应该将第 n 个设置为 1。

如果你仔细观察,你会发现数据表需要一些简单的重命名列和摆脱其他空列的争论。此外,该表仅包含 15 行。

要获得完整的表格,您可以使用 RSelenium 来扩展表格并完全捕获它。