使用rvest获取数据返回无匹配

时间:2016-11-16 17:27:54

标签: r web-scraping rvest

我正在尝试使用rvest从politco的网站上获取一些选举结果。

http://www.politico.com/2016-election/results/map/president/wisconsin/

我无法立即获取页面上的所有数据,因此我采用了县级方法。每个县都有一个独特的css选择器(例如Adams County的':'#countyAdams .results-table')。所以我抓住了其他地方的所有县名,并建立了一个快速循环(是的,我知道循环是R中的不良做法,但我预计这种方法需要大约3分钟)。

抓住网址

wiscoSixteen <- read_html("http://www.politico.com/2016-election/results/map/president/wisconsin")

创建一个空的data.frame(并且我没有预先定义列)

stateDf <- NULL

获取县名单(这不完整,但为了达到例行程序,我们不需要所有70个县)

wiscoCounties <- c("Adams", "Ashland", "Barron", "Bayfield", "Brown", "Buffalo", "Burnett", "Calumet", "Chippewa", "Clark", "Columbia", "Crawford", "Dane", "Dodge", "Door", "Douglas", "Dunn", "Eau Claire", "Florence", "Fond du Lac", "Forest", "Grant", "Green", "Green Lake", "Iowa", "Iron", "Jackson", "Jefferson", "Juneau")

我的'for'循环:

for (i in 1:length(wiscoCounties)){

    #Pull out the i'th county name and paste it in a string
    wiscoResult <- wiscoSixteen %>% html_node(paste("#county"," .results-table", sep=wiscoCounties[i])) %>% html_table()

    #add a column for the county name so I can ID later
    wiscoResult[,4] <- wiscoCounties[i]

    #then rbind 
    stateDf <- rbind(stateDf, wiscoResult)
}

当它通过第10个县时它停止并返回'错误:没有匹配'。

在第11个县'哥伦比亚'找不到任何独特之处。对正在发生的事情感到茫然。我确定这是愚蠢的,因为通常情况就是这样。任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:3)

那么,为什么不只是使用最终填充这些表的XHR请求(我有点意外,因为它们是从单独的数据请求生成的,所以从中获取任何数据):< / p>

library(httr)
library(stringi)
library(purrr)
library(dplyr)

res <- GET("http://s3.amazonaws.com/origin-east-elections.politico.com/mapdata/2016/WI_20161108.xml")
dat <- readLines(textConnection(content(res, as="text")))

stri_split_fixed(dat[2], "|")[[1]] %>%
  stri_replace_last_fixed(";", "") %>% 
  stri_split_fixed(";", 3) %>% 
  map_df(~setNames(as.list(.), c("rep_id", "first", "last"))) -> candidates

dat[stri_detect_regex(dat, "^WI;P;G")] %>% 
  stri_replace_first_regex("^WI;P;G;", "") %>% 
  map_df(function(x) {

    county_results <- stri_split_fixed(x, "||", 2)[[1]]

    stri_replace_last_fixed(county_results[1], ";;", "") %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("fips", "name", "x1", "reporting", "x2", "x3", "x4"))) -> county_prefix

    stri_split_fixed(county_results[2], "|")[[1]] %>% 
      stri_split_fixed(";") %>% 
      map_df(~setNames(as.list(.), c("rep_id", "party", "count", "pct", "x5", "x6", "x7", "x8", "candidate_idx"))) %>% 
      left_join(candidates, by="rep_id") -> df

    df$fips <- county_prefix$fips
    df$name <- county_prefix$name
    df$reporting <- county_prefix$reporting

    select(df, -starts_with("x"))

  }) -> results

似乎是完整的数据:

glimpse(results)
## Observations: 511
## Variables: 10
## $ rep_id        <chr> "WI270631108", "WI270621108", "WI270691108", "WI270711108", "WI270701108", "WI270731108", "WI270721108",...
## $ party         <chr> "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "...
## $ count         <chr> "1382210", "1409467", "106442", "12179", "1561", "1781", "30980", "3780", "5983", "207", "44", "4", "9",...
## $ pct           <chr> "46.9", "47.9", "3.6", "0.4", "0.1", "0.1", "1.1", "37.4", "59.2", "2.0", "0.4", "0.0", "0.1", "0.8", "5...
## $ candidate_idx <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7",...
## $ first         <chr> "Clinton", "Trump", "Johnson", "Castle", "De La Fuente", "Moorehead", "Stein", "Clinton", "Trump", "John...
## $ last          <chr> "Hillary", "Donald", "Gary", "Darrell", "Rocky", "Monica", "Jill", "Hillary", "Donald", "Gary", "Darrell...
## $ fips          <chr> "0", "0", "0", "0", "0", "0", "0", "55001", "55001", "55001", "55001", "55001", "55001", "55001", "55003...
## $ name          <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Adams", "Ada...
## $ reporting     <chr> "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100....

尽管&#34; .xml&#34;在URL上的扩展名,它不是XML数据。我也不知道实际上有哪些专栏,但你可以深入研究。此外,还有另一部分数据:

WI;S;G;0;Wisconsin;X;100.0;X;;50885;;||WI269201108;Dem;1380496;46.8;;X;;;1|WI267231108;GOP;1479262;50.2;X;X;X;;2|WI270541108;Lib;87291;3.0;;X;;;3
WI;S;G;55001;Adams;X;100.0;X;;50885;;||WI269201108;Dem;4093;41.2;;X;;;1|WI267231108;GOP;5346;53.9;X;X;X;;2|WI270541108;Lib;486;4.9;;X;;;3
WI;S;G;55003;Ashland;X;100.0;X;;50885;;||WI269201108;Dem;4349;55.1;;X;;;1|WI267231108;GOP;3337;42.2;X;X;X;;2|WI270541108;Lib;214;2.7;;X;;;3
WI;S;G;55005;Barron;X;100.0;X;;50885;;||WI269201108;Dem;8691;38.8;;X;;;1|WI267231108;GOP;12863;57.4;X;X;X;;2|WI270541108;Lib;853;3.8;;X;;;3
WI;S;G;55007;Bayfield;X;100.0;X;;50885;;||WI269201108;Dem;5161;54.6;;X;;;1|WI267231108;GOP;4022;42.6;X;X;X;;2|WI270541108;Lib;263;2.8;;X;;;3
WI;S;G;55009;Brown;X;100.0;X;;50885;;||WI269201108;Dem;51004;40.0;;X;;;1|WI267231108;GOP;71750;56.3;X;X;X;;2|WI270541108;Lib;4615;3.6;;X;;;3
WI;S;G;55011;Buffalo;X;100.0;X;;50885;;||WI269201108;Dem;2746;39.9;;X;;;1|WI267231108;GOP;3850;56.0;X;X;X;;2|WI270541108;Lib;285;4.1;;X;;;3
WI;S;G;55013;Burnett;X;100.0;X;;50885;;||WI269201108;Dem;3143;37.4;;X;;;1|WI267231108;GOP;4998;59.5;X;X;X;;2|WI270541108;Lib;258;3.1;;X;;;3

这对于那个页面显然意味着什么(它有点显而易见,但我对选举感到厌倦,我对数据做了很多事情)你可以用类似的方式处理以上是什么。