我正在尝试使用rvest从politco的网站上获取一些选举结果。
http://www.politico.com/2016-election/results/map/president/wisconsin/
我无法立即获取页面上的所有数据,因此我采用了县级方法。每个县都有一个独特的css选择器(例如Adams County的':'#countyAdams .results-table')。所以我抓住了其他地方的所有县名,并建立了一个快速循环(是的,我知道循环是R中的不良做法,但我预计这种方法需要大约3分钟)。
抓住网址
wiscoSixteen <- read_html("http://www.politico.com/2016-election/results/map/president/wisconsin")
创建一个空的data.frame(并且我没有预先定义列)
stateDf <- NULL
获取县名单(这不完整,但为了达到例行程序,我们不需要所有70个县)
wiscoCounties <- c("Adams", "Ashland", "Barron", "Bayfield", "Brown", "Buffalo", "Burnett", "Calumet", "Chippewa", "Clark", "Columbia", "Crawford", "Dane", "Dodge", "Door", "Douglas", "Dunn", "Eau Claire", "Florence", "Fond du Lac", "Forest", "Grant", "Green", "Green Lake", "Iowa", "Iron", "Jackson", "Jefferson", "Juneau")
我的'for'循环:
for (i in 1:length(wiscoCounties)){
#Pull out the i'th county name and paste it in a string
wiscoResult <- wiscoSixteen %>% html_node(paste("#county"," .results-table", sep=wiscoCounties[i])) %>% html_table()
#add a column for the county name so I can ID later
wiscoResult[,4] <- wiscoCounties[i]
#then rbind
stateDf <- rbind(stateDf, wiscoResult)
}
当它通过第10个县时它停止并返回'错误:没有匹配'。
在第11个县'哥伦比亚'找不到任何独特之处。对正在发生的事情感到茫然。我确定这是愚蠢的,因为通常情况就是这样。任何帮助表示赞赏。
答案 0 :(得分:3)
那么,为什么不只是使用最终填充这些表的XHR请求(我有点意外,因为它们是从单独的数据请求生成的,所以从中获取任何数据):< / p>
library(httr)
library(stringi)
library(purrr)
library(dplyr)
res <- GET("http://s3.amazonaws.com/origin-east-elections.politico.com/mapdata/2016/WI_20161108.xml")
dat <- readLines(textConnection(content(res, as="text")))
stri_split_fixed(dat[2], "|")[[1]] %>%
stri_replace_last_fixed(";", "") %>%
stri_split_fixed(";", 3) %>%
map_df(~setNames(as.list(.), c("rep_id", "first", "last"))) -> candidates
dat[stri_detect_regex(dat, "^WI;P;G")] %>%
stri_replace_first_regex("^WI;P;G;", "") %>%
map_df(function(x) {
county_results <- stri_split_fixed(x, "||", 2)[[1]]
stri_replace_last_fixed(county_results[1], ";;", "") %>%
stri_split_fixed(";") %>%
map_df(~setNames(as.list(.), c("fips", "name", "x1", "reporting", "x2", "x3", "x4"))) -> county_prefix
stri_split_fixed(county_results[2], "|")[[1]] %>%
stri_split_fixed(";") %>%
map_df(~setNames(as.list(.), c("rep_id", "party", "count", "pct", "x5", "x6", "x7", "x8", "candidate_idx"))) %>%
left_join(candidates, by="rep_id") -> df
df$fips <- county_prefix$fips
df$name <- county_prefix$name
df$reporting <- county_prefix$reporting
select(df, -starts_with("x"))
}) -> results
似乎是完整的数据:
glimpse(results)
## Observations: 511
## Variables: 10
## $ rep_id <chr> "WI270631108", "WI270621108", "WI270691108", "WI270711108", "WI270701108", "WI270731108", "WI270721108",...
## $ party <chr> "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "GOP", "Lib", "CST", "ADP", "WW", "Grn", "Dem", "...
## $ count <chr> "1382210", "1409467", "106442", "12179", "1561", "1781", "30980", "3780", "5983", "207", "44", "4", "9",...
## $ pct <chr> "46.9", "47.9", "3.6", "0.4", "0.1", "0.1", "1.1", "37.4", "59.2", "2.0", "0.4", "0.0", "0.1", "0.8", "5...
## $ candidate_idx <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7",...
## $ first <chr> "Clinton", "Trump", "Johnson", "Castle", "De La Fuente", "Moorehead", "Stein", "Clinton", "Trump", "John...
## $ last <chr> "Hillary", "Donald", "Gary", "Darrell", "Rocky", "Monica", "Jill", "Hillary", "Donald", "Gary", "Darrell...
## $ fips <chr> "0", "0", "0", "0", "0", "0", "0", "55001", "55001", "55001", "55001", "55001", "55001", "55001", "55003...
## $ name <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Adams", "Ada...
## $ reporting <chr> "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100.0", "100....
尽管&#34; .xml&#34;在URL上的扩展名,它不是XML数据。我也不知道实际上有哪些专栏,但你可以深入研究。此外,还有另一部分数据:
WI;S;G;0;Wisconsin;X;100.0;X;;50885;;||WI269201108;Dem;1380496;46.8;;X;;;1|WI267231108;GOP;1479262;50.2;X;X;X;;2|WI270541108;Lib;87291;3.0;;X;;;3
WI;S;G;55001;Adams;X;100.0;X;;50885;;||WI269201108;Dem;4093;41.2;;X;;;1|WI267231108;GOP;5346;53.9;X;X;X;;2|WI270541108;Lib;486;4.9;;X;;;3
WI;S;G;55003;Ashland;X;100.0;X;;50885;;||WI269201108;Dem;4349;55.1;;X;;;1|WI267231108;GOP;3337;42.2;X;X;X;;2|WI270541108;Lib;214;2.7;;X;;;3
WI;S;G;55005;Barron;X;100.0;X;;50885;;||WI269201108;Dem;8691;38.8;;X;;;1|WI267231108;GOP;12863;57.4;X;X;X;;2|WI270541108;Lib;853;3.8;;X;;;3
WI;S;G;55007;Bayfield;X;100.0;X;;50885;;||WI269201108;Dem;5161;54.6;;X;;;1|WI267231108;GOP;4022;42.6;X;X;X;;2|WI270541108;Lib;263;2.8;;X;;;3
WI;S;G;55009;Brown;X;100.0;X;;50885;;||WI269201108;Dem;51004;40.0;;X;;;1|WI267231108;GOP;71750;56.3;X;X;X;;2|WI270541108;Lib;4615;3.6;;X;;;3
WI;S;G;55011;Buffalo;X;100.0;X;;50885;;||WI269201108;Dem;2746;39.9;;X;;;1|WI267231108;GOP;3850;56.0;X;X;X;;2|WI270541108;Lib;285;4.1;;X;;;3
WI;S;G;55013;Burnett;X;100.0;X;;50885;;||WI269201108;Dem;3143;37.4;;X;;;1|WI267231108;GOP;4998;59.5;X;X;X;;2|WI270541108;Lib;258;3.1;;X;;;3
这对于那个页面显然意味着什么(它有点显而易见,但我对选举感到厌倦,我对数据做了很多事情)你可以用类似的方式处理以上是什么。