我使用R(working example)从维基百科中抓取数据:
library(reshape)
library(RCurl)
library(XML)
theurl <- getURL("https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election", ssl.verifyPeer=FALSE)
tables <- readHTMLTable(theurl)
raw_polling_data <- tables[[2]]
但是日期数据以时髦的格式出现,所有这些都是前面的0&#39;
Polling Firm Last Date\nof Polling Link Cons.
1 Nanos Research 000000002015-07-31-0000July 31, 2015 PDF 31.5
2 Innovative Research 000000002015-07-30-0000July 30, 2015 HTML 29.3
3 Forum Research 000000002015-07-28-0000July 28, 2015 PDF 33
4 EKOS 000000002015-07-28-0000July 28, 2015 PDF 30.1
5 Ipsos Reid 000000002015-07-27-0000July 27, 2015 HTML 33
6 Mainstreet Research 000000002015-07-21-0000July 21, 2015 HTML 38
7 Forum Research 000000002015-07-20-0000July 20, 2015 PDF 28
...
如何在表格中将这些日期转换为yyyy-mm-dd来获取此信息:
Polling Firm Date... Link Cons.
1 Nanos Research 2015-07-31 PDF 31.5
2 Innovative Research 2015-07-30 HTML 29.3
...
答案 0 :(得分:3)
假设前导0的数量始终相同(即8)
cleanDate <- as.Date(substr(raw_polling_data[, 2], 9, 18))
检查
head(cleanDate)
[1] "2015-07-31" "2015-07-30" "2015-07-28" "2015-07-28" "2015-07-27" "2015-07-21"
答案 1 :(得分:1)
这是编码问题。试试htmltab。目前,请使用github版本:
devtools::install_github("crubba/htmltab")
library("htmltab")
htmltab("https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election", which = 2)