如何从R中描述的变量中提取月份和年份?

时间:2015-09-02 13:22:02

标签: r

我有一个包含不同变量的数据集,1个变量包含年份和月份的描述,从我想要提取月份和年份的变量,但我无法获取。

Sample_Data

var1   var2 
203    UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008                           
205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010                           
206    2008 MARCH MONTH BROKERAGE                            
207    UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL                           
204    BROKERAGE FOR THE MONTH OF MARCH 2008                           


Expected_output:

var1   var2                                            month   year     
203    UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008   MARCH   2008                      
205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010     MAY     2010                           
206    2008 MARCH MONTH BROKERAGE                      MARCH   2008                      
207    UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL       APRIL   2009                           
204    BROKERAGE FOR THE MONTH OF MARCH 2008           MARCH   2008

Tried:
library(lubridate)
Sample_Data$month = month(Sample_Data$var2)
Sample_Data$year = year(Sample_Data$var2)

我尝试过不同的方法,例如,使用了lubridate,posixlt但无法找到解决方案。请以这种方式帮助我。

3 个答案:

答案 0 :(得分:2)

我们可以使用extract中的tidyr指定正则表达式以匹配输入数据集中显示的字符。

library(tidyr)
extract(df1, var2, into=c('month', 'year'), '.*\\s+([A-Z]+)\\s+(\\d+)$', 
             remove=FALSE, convert=TRUE)
#  var1                                          var2 month year
#1  203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2  205   UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010   MAY 2010
#3  206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4  207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5  204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008

或者使用base R,我们从' var2'中删除字符串开头的子字符串,捕获单词(\\w+),后跟空格({{1} })后跟数字(\\s+)直到字符串结尾,在替换中,我们指定捕获组(\\d+)。我们使用\\1阅读此内容,以便在' df1'中创建新列。

read.table

注意:在这两种方法中,我们都将新列转换为各自的df1[c('month', 'year')] <- read.table(text=sub('.*(\\b\\w+\\s+\\d+)$', '\\1', df1$var2), stringsAsFactors=FALSE) df1 # var1 var2 month year #1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008 #2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010 #3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008 #4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009 #5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008

数据

class

答案 1 :(得分:0)

你还不能把它当作日期,因为你需要解析字符串。尝试使用t(sapply(strsplit(Sample_Data$var2," "),function(x) x[7:8]))获取所需的两列。

答案 2 :(得分:0)

您不需要使用lubridate,因为您实际上并不使用Date数据类型。在基数中使用strsplitvar2拆分为“字词”。看起来月份永远是倒数第二个词,年份是最后一个词。

# reproducible example please!
d <- read.table(textConnection("
var1, var2
203, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205, UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
207, UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009
204, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
"), header=TRUE, sep=",", stringsAsFactors=FALSE)

get_month <- function(s) {
  words <- unlist(strsplit(s, " "))
  words[length(words)-1]
}
get_year <- function(s) {
  words <- unlist(strsplit(s, " "))
  as.integer(words[length(words)])
}

d$month = sapply(d$var2, get_month)

d$year = lapply(d$var2, get_year)

d

产生所需的输出

> d
  var1                                           var2 month year
1  203  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
2  205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010   MAY 2010
3  206  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
4  207  UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
5  204  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008