从r中的维基百科表中提取日期信息

时间:2014-06-15 18:14:52

标签: r date

我使用XML包从维基百科中删除了以下表格:

http://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads

您会注意到网页上的dob变量如下: 1985年1月4日(29岁)

这读取我的R数据帧如下: (1985-01-04)1985年1月4日(29岁)

在裁剪数据中将R作为因子处理,而不是日期。

我正在尝试创建一个仅具有YYYY-MM-DD格式的dob的变量,但是我无法重新格式化'dob'变量。

我尝试了以下但没有成功(我的数据框称为alpha):

alpha$newvar <- as.Date(alpha$dob, "%Y%m%d")
alpha$newvar <- strptime(alpha$dob,format="%Y%m%d")

以下是韩国队的样本数据:

structure(list(no = structure(c(1L, 12L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L, 
14L, 15L, 16L), .Label = c("1", "10", "11", "12", "13", "14", 
"15", "16", "17", "18", "19", "2", "20", "21", "22", "23", "3", 
"4", "5", "6", "7", "8", "9"), class = "factor"), pos = structure(c(1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 2L, 1L, 2L, 1L), .Label = c("1GK", "2DF", "3MF", "4FW"
), class = "factor"), player = structure(c(6L, 9L, 23L, 14L, 
12L, 4L, 8L, 1L, 22L, 19L, 17L, 18L, 13L, 2L, 20L, 7L, 16L, 11L, 
5L, 3L, 10L, 21L, 15L), .Label = c("Ha Dae-sung", "Han Kook-young", 
"Hong Jeong-ho", "Hwang Seok-ho", "Ji Dong-won", "Jung Sung-ryong", 
"Ki Sung-yueng", "Kim Bo-kyung", "Kim Chang-soo", "Kim Seung-gyu", 
"Kim Shin-wook", "Kim Young-gwon", "Koo Ja-cheol (c)", "Kwak Tae-hwi", 
"Lee Bum-young", "Lee Chung-yong", "Lee Keun-ho", "Lee Yong", 
"Park Chu-young", "Park Jong-woo", "Park Joo-ho[67]", "Son Heung-min", 
"Yun Suk-young"), class = "factor"), dob = structure(c(2L, 6L, 
18L, 1L, 19L, 15L, 17L, 3L, 23L, 5L, 4L, 7L, 12L, 20L, 13L, 11L, 
10L, 9L, 22L, 16L, 21L, 8L, 14L), .Label = c("(1981-07-08)8 July 1981 (aged 32)", 
"(1985-01-04)4 January 1985 (aged 29)", "(1985-03-02)2 March 1985 (aged 29)", 
"(1985-04-11)11 April 1985 (aged 29)", "(1985-07-10)10 July 1985 (aged 28)", 
"(1985-09-12)12 September 1985 (aged 28)", "(1986-12-24)24 December 1986 (aged 27)", 
"(1987-01-16)16 January 1987 (aged 27)", "(1988-04-14)14 April 1988 (aged 26)", 
"(1988-07-02)2 July 1988 (aged 25)", "(1989-01-24)24 January 1989 (aged 25)", 
"(1989-02-27)27 February 1989 (aged 25)", "(1989-03-10)10 March 1989 (aged 25)", 
"(1989-04-02)2 April 1989 (aged 25)", "(1989-06-27)27 June 1989 (aged 24)", 
"(1989-08-12)12 August 1989 (aged 24)", "(1989-10-06)6 October 1989 (aged 24)", 
"(1990-02-13)13 February 1990 (aged 24)", "(1990-02-27)27 February 1990 (aged 24)", 
"(1990-04-19)19 April 1990 (aged 24)", "(1990-09-30)30 September 1990 (aged 23)", 
"(1991-05-28)28 May 1991 (aged 23)", "(1992-07-08)8 July 1992 (aged 21)"
 ), class = "factor"), caps = structure(c(17L, 20L, 13L, 11L, 
6L, 10L, 9L, 4L, 7L, 19L, 18L, 3L, 12L, 2L, 2L, 16L, 15L, 8L, 
9L, 7L, 14L, 5L, 1L), .Label = c("0", "10", "12", "13", "14", 
"21", "25", "27", "28", "3", "35", "37", "4", "5", "55", "58", 
"61", "63", "64", "9"), class = "factor"), club = structure(c(16L, 
10L, 12L, 1L, 8L, 13L, 6L, 3L, 2L, 18L, 14L, 17L, 11L, 10L, 9L, 
15L, 4L, 17L, 7L, 7L, 17L, 11L, 5L), .Label = c("Al-Hilal", "Bayer Leverkusen", 
"Beijing Guoan", "Bolton Wanderers", "Busan IPark", "Cardiff City", 
"FC Augsburg", "Guangzhou Evergrande", "Guangzhou R&F", "Kashiwa Reysol", 
"Mainz 05", "Queens Park Rangers", "Sanfrecce Hiroshima", "Sangju Sangmu", 
"Sunderland", "Suwon Bluewings", "Ulsan Hyundai", "Watford"), class = "factor")),      .Names = c("no", 
"pos", "player", "dob", "caps", "club"), row.names = c(NA, -23L
), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

我可以回答我自己的问题。问题是要正确地告诉R日期格式,它必须知道日期包含在括号内。

所以,

as.character(strptime(alpha$dob, format = "(%Y-%m-%d)"))

put&#34;(%Y-%m-%d)&#34;格式为R时,在括号内搜索字符串中的日期格式。