我想从文本中提取年份。
以下代码为我提供了一个值为1998和2009的向量
description= "I was teaching at the univeristy from 1998 to 2009"
teaching = as.numeric(str_extract_all(description ,"\\d{4}")[[1]])
那我想减去几年
teaching[2] - teaching[1]
[1] 11
但是问题是我在数据框中有一个包含这些文本的列,我想从每个文本中提取年份并减去它们。
我尝试这样做,但感到困惑
аа = lapply(df$description, str_extract_all,"\\d{4}")
bb = lapply(aa, function(x) x[1])
答案 0 :(得分:3)
您可以尝试以下方法:
# example data
df <- data.frame(description = paste("I was teaching at the univeristy from",1990:1995, "to",seq(2010,2020,by =2)))
# description
#1 I was teaching at the univeristy from 1990 to 2010
#2 I was teaching at the univeristy from 1991 to 2012
#3 I was teaching at the univeristy from 1992 to 2014
#4 I was teaching at the univeristy from 1993 to 2016
#5 I was teaching at the univeristy from 1994 to 2018
#6 I was teaching at the univeristy from 1995 to 2020
years <- str_extract_all(df$description, "\\d{4}")
sapply(years, function(x) diff(as.numeric(x)))
# 20 21 22 23 24 25
处理NA的替代方法:
# example data
df <- data.frame(description = c(paste("I was teaching at the univeristy from",1990:1995, "to",seq(2010,2020,by =2)), "I was not teaching at all"))
years <- str_extract_all(df$description, "\\d{4}", simplify = TRUE)
apply(years, 1, function(x) diff(as.numeric(x)))
# 20 21 22 23 24 25 NA