使用R修复从维基百科中删除的日期数据

时间:2015-08-06 07:46:02

标签: r

我使用R(working example)从维基百科中抓取数据:

library(reshape)
library(RCurl)
library(XML)
theurl <- getURL("https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election", ssl.verifyPeer=FALSE)
tables <- readHTMLTable(theurl)
raw_polling_data <- tables[[2]]

但是日期数据以时髦的格式出现,所有这些都是前面的0&#39;

           Polling Firm      Last Date\nof Polling                Link Cons.
1        Nanos Research      000000002015-07-31-0000July 31, 2015  PDF  31.5
2   Innovative Research      000000002015-07-30-0000July 30, 2015 HTML  29.3
3        Forum Research      000000002015-07-28-0000July 28, 2015  PDF    33
4                  EKOS      000000002015-07-28-0000July 28, 2015  PDF  30.1
5            Ipsos Reid      000000002015-07-27-0000July 27, 2015 HTML    33
6   Mainstreet Research      000000002015-07-21-0000July 21, 2015 HTML    38
7        Forum Research      000000002015-07-20-0000July 20, 2015  PDF    28
...

如何在表格中将这些日期转换为yyyy-mm-dd来获取此信息:

           Polling Firm      Date...    Link Cons.
1        Nanos Research      2015-07-31  PDF  31.5
2   Innovative Research      2015-07-30 HTML  29.3
...

2 个答案:

答案 0 :(得分:3)

假设前导0的数量始终相同(即8)

cleanDate <- as.Date(substr(raw_polling_data[, 2], 9, 18))

检查

head(cleanDate)
[1] "2015-07-31" "2015-07-30" "2015-07-28" "2015-07-28" "2015-07-27" "2015-07-21"

答案 1 :(得分:1)

这是编码问题。试试htmltab。目前,请使用github版本:

devtools::install_github("crubba/htmltab")
library("htmltab")
htmltab("https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election", which = 2)