我正试图用这段代码刮掉wiki表:
library(tidyverse)
library(rvest)
my_url <- "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions"
mytable <- read_html(my_url) %>% html_nodes("table") %>% .[[4]]
mytable <- mytable %>% html_table()
问题是,在两个名称(冠军和亚军)列中返回的表中,值加倍。好吧并没有完全加倍,它看起来像两种形式以不同的顺序呈现名字/姓氏和逗号一次。它在原始wiki页面上看起来不像那里只有“name surname”可见。为什么会发生这种情况以及如何摆脱它?我需要这些列只包含'name surname'。
head(mytable)
Year[f] Country Champion Country Runner-up Score in the final[4][14]
1 1969 AUS Laver, RodRod Laver[b] ESP Gimeno, AndrésAndrés Gimeno 6–3, 6–4, 7–5
2 1970 USA Ashe, ArthurArthur Ashe AUS Crealy, DickDick Crealy 6–4, 9–7, 6–2
3 1971 AUS Rosewall, KenKen Rosewall USA Ashe, ArthurArthur Ashe 6–1, 7–5, 6–3
4 1972 AUS Rosewall, KenKen Rosewall AUS Anderson, MalcolmMalcolm Anderson 7–6(7–2), 6–3, 7–5
5 1973 AUS Newcombe, JohnJohn Newcombe NZL Parun, OnnyOnny Parun 6–3, 6–7, 7–5, 6–1
6 1974 USA Connors, JimmyJimmy Connors AUS Dent, PhilPhil Dent 7–6(9–7), 6–4, 4–6, 6–3
答案 0 :(得分:1)
htmltab
可用于废弃这些Wiki表格。
library(htmltab)
#data cleaning steps
bFun <- function(node) {
x <- XML::xmlValue(node)
gsub("\\s[<†‡].*$", "", iconv(x, from = 'UTF-8', to = "Windows-1252", sub="byte"))
}
df1 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions",
which = 4,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df1)
给出了
# Year[f] Country Champion Country Runner-up Score in the final[4][14]
#2 1969 AUS Rod Laver[b] ESP Andrés Gimeno 6–3, 6–4, 7–5
#3 1970 USA Arthur Ashe AUS Dick Crealy 6–4, 9–7, 6–2
#4 1971 AUS Ken Rosewall USA Arthur Ashe 6–1, 7–5, 6–3
#5 1972 AUS Ken Rosewall AUS Malcolm Anderson 7–6(7–2), 6–3, 7–5
#6 1973 AUS John Newcombe NZL Onny Parun 6–3, 6–7, 7–5, 6–1
#7 1974 USA Jimmy Connors AUS Phil Dent 7–6(9–7), 6–4, 4–6, 6–3
和
df2 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Wimbledon_gentlemen%27s_singles_champions",
which = 3,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df2)
给出
# Year[d] Country Champion Country Runner-up Score in the final[4]
#2 1877 BRI[e] Spencer Gore BRI William Marshall 6–1, 6–2, 6–4
#3 1878 BRI Frank Hadow BRI Spencer Gore 7–5, 6–1, 9–7
#4 1879 BRI John Hartley BRI Vere St. Leger Goold 6–2, 6–4, 6–2
#5 1880 BRI John Hartley BRI Herbert Lawford 6–3, 6–2, 2–6, 6–3
#6 1881 BRI William Renshaw BRI John Hartley 6–0, 6–1, 6–1
#7 1882 BRI William Renshaw BRI Ernest Renshaw 6–1, 2–6, 4–6, 6–2, 6–2