我有一个包含不同变量的数据集,1个变量包含年份和月份的描述,从我想要提取月份和年份的变量,但我无法获取。
Sample_Data
var1 var2
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206 2008 MARCH MONTH BROKERAGE
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL
204 BROKERAGE FOR THE MONTH OF MARCH 2008
Expected_output:
var1 var2 month year
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
206 2008 MARCH MONTH BROKERAGE MARCH 2008
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL APRIL 2009
204 BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
Tried:
library(lubridate)
Sample_Data$month = month(Sample_Data$var2)
Sample_Data$year = year(Sample_Data$var2)
我尝试过不同的方法,例如,使用了lubridate,posixlt但无法找到解决方案。请以这种方式帮助我。
答案 0 :(得分:2)
我们可以使用extract
中的tidyr
指定正则表达式以匹配输入数据集中显示的字符。
library(tidyr)
extract(df1, var2, into=c('month', 'year'), '.*\\s+([A-Z]+)\\s+(\\d+)$',
remove=FALSE, convert=TRUE)
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
或者使用base R
,我们从' var2'中删除字符串开头的子字符串,捕获单词(\\w+
),后跟空格({{1} })后跟数字(\\s+
)直到字符串结尾,在替换中,我们指定捕获组(\\d+
)。我们使用\\1
阅读此内容,以便在' df1'中创建新列。
read.table
注意:在这两种方法中,我们都将新列转换为各自的df1[c('month', 'year')] <- read.table(text=sub('.*(\\b\\w+\\s+\\d+)$',
'\\1', df1$var2), stringsAsFactors=FALSE)
df1
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
。
class
答案 1 :(得分:0)
你还不能把它当作日期,因为你需要解析字符串。尝试使用t(sapply(strsplit(Sample_Data$var2," "),function(x) x[7:8]))
获取所需的两列。
答案 2 :(得分:0)
您不需要使用lubridate,因为您实际上并不使用Date数据类型。在基数中使用strsplit
将var2
拆分为“字词”。看起来月份永远是倒数第二个词,年份是最后一个词。
# reproducible example please!
d <- read.table(textConnection("
var1, var2
203, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205, UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
207, UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009
204, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
"), header=TRUE, sep=",", stringsAsFactors=FALSE)
get_month <- function(s) {
words <- unlist(strsplit(s, " "))
words[length(words)-1]
}
get_year <- function(s) {
words <- unlist(strsplit(s, " "))
as.integer(words[length(words)])
}
d$month = sapply(d$var2, get_month)
d$year = lapply(d$var2, get_year)
d
产生所需的输出
> d
var1 var2 month year
1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008