html_table使列的值加倍

时间:2018-06-08 06:07:31

标签: r web-scraping

我正试图用这段代码刮掉wiki表:

library(tidyverse)
library(rvest)
my_url <- "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions"
mytable <- read_html(my_url) %>% html_nodes("table") %>% .[[4]] 
mytable <- mytable %>% html_table()

问题是,在两个名称(冠军和亚军)列中返回的表中,值加倍。好吧并没有完全加倍,它看起来像两种形式以不同的顺序呈现名字/姓氏和逗号一次。它在原始wiki页面上看起来不像那里只有“name surname”可见。为什么会发生这种情况以及如何摆脱它?我需要这些列只包含'name surname'。

head(mytable)    
Year[f] Country                    Champion Country                         Runner-up Score in the final[4][14]
    1    1969     AUS      Laver, RodRod Laver[b]     ESP       Gimeno, AndrésAndrés Gimeno             6–3, 6–4, 7–5
    2    1970     USA     Ashe, ArthurArthur Ashe     AUS           Crealy, DickDick Crealy             6–4, 9–7, 6–2
    3    1971     AUS   Rosewall, KenKen Rosewall     USA           Ashe, ArthurArthur Ashe             6–1, 7–5, 6–3
    4    1972     AUS   Rosewall, KenKen Rosewall     AUS Anderson, MalcolmMalcolm Anderson        7–6(7–2), 6–3, 7–5
    5    1973     AUS Newcombe, JohnJohn Newcombe     NZL             Parun, OnnyOnny Parun        6–3, 6–7, 7–5, 6–1
    6    1974     USA Connors, JimmyJimmy Connors     AUS               Dent, PhilPhil Dent   7–6(9–7), 6–4, 4–6, 6–3

1 个答案:

答案 0 :(得分:1)

htmltab可用于废弃这些Wiki表格。

library(htmltab)

#data cleaning steps
bFun <- function(node) {
  x <- XML::xmlValue(node)
  gsub("\\s[<†‡].*$", "", iconv(x, from = 'UTF-8', to = "Windows-1252", sub="byte"))
}

df1 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions", 
              which = 4,
              rm_superscript = F,
              bodyFun = bFun)         #this function is not required if you are executing the code from Mac
head(df1)

给出了

#  Year[f] Country      Champion Country        Runner-up Score in the final[4][14]
#2    1969     AUS  Rod Laver[b]     ESP    Andrés Gimeno             6–3, 6–4, 7–5
#3    1970     USA   Arthur Ashe     AUS      Dick Crealy             6–4, 9–7, 6–2
#4    1971     AUS  Ken Rosewall     USA      Arthur Ashe             6–1, 7–5, 6–3
#5    1972     AUS  Ken Rosewall     AUS Malcolm Anderson        7–6(7–2), 6–3, 7–5
#6    1973     AUS John Newcombe     NZL       Onny Parun        6–3, 6–7, 7–5, 6–1
#7    1974     USA Jimmy Connors     AUS        Phil Dent   7–6(9–7), 6–4, 4–6, 6–3

df2 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Wimbledon_gentlemen%27s_singles_champions", 
              which = 3,
              rm_superscript = F,
              bodyFun = bFun)         #this function is not required if you are executing the code from Mac
head(df2)

给出

#  Year[d] Country        Champion Country            Runner-up   Score in the final[4]
#2    1877  BRI[e]    Spencer Gore     BRI     William Marshall           6–1, 6–2, 6–4
#3    1878     BRI     Frank Hadow     BRI         Spencer Gore           7–5, 6–1, 9–7
#4    1879     BRI    John Hartley     BRI Vere St. Leger Goold           6–2, 6–4, 6–2
#5    1880     BRI    John Hartley     BRI      Herbert Lawford      6–3, 6–2, 2–6, 6–3
#6    1881     BRI William Renshaw     BRI         John Hartley           6–0, 6–1, 6–1
#7    1882     BRI William Renshaw     BRI       Ernest Renshaw 6–1, 2–6, 4–6, 6–2, 6–2