我第一次使用data.table。
我的桌子里有一个大约400,000年的专栏。我需要将它们从出生日期转换为年龄。
这样做的最佳方式是什么?
答案 0 :(得分:20)
从this blog entry的评论中,我在age_calc
包中找到了eeptools
函数。它处理边缘情况(闰年等),检查输入并且看起来非常稳健。
library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years
[1] 3.893151 18.731507
floor(age_calc(x[1],x[2], units = "years"))
[1] 3 18
为您的数据
yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))
假设你想要整数年的年龄。
答案 1 :(得分:19)
我一直在考虑这个问题,到目前为止对这两个答案一直不满意。我喜欢使用lubridate
,就像@KFB一样,但我也希望在函数中很好地包装好东西,就像我使用eeptools
包的答案一样。所以这里是一个使用lubridate区间方法的包装函数,有一些不错的选项:
#' Calculate age
#'
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' @param dob date-of-birth, the day to start calculating age.
#' @param age.day the date on which age is to be calculated.
#' @param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' @param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' @return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' @examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
calc.age = interval(dob, age.day) / duration(num = 1, units = units)
if (floor) return(as.integer(floor(calc.age)))
return(calc.age)
}
用法示例:
> my.dob <- as.Date('1983-10-20')
> age(my.dob)
[1] 31
> age(my.dob, floor = FALSE)
[1] 31.15616
> age(my.dob, units = "minutes")
[1] 16375680
> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26
答案 2 :(得分:3)
假设你有一个data.table,你可以在下面做:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()
选项1:使用&#34; as.period&#34;来自lubriate包
X[, age := as.period(Sys.Date() - birth)][]
birth age
1: 1970-01-01 44y 0m 327d 0H 0M 0S
2: 1971-01-01 43y 0m 327d 6H 0M 0S
3: 1972-01-01 42y 0m 327d 12H 0M 0S
4: 1973-01-01 41y 0m 326d 18H 0M 0S
5: 1974-01-01 40y 0m 327d 0H 0M 0S
6: 1975-01-01 39y 0m 327d 6H 0M 0S
7: 1976-01-01 38y 0m 327d 12H 0M 0S
8: 1977-01-01 37y 0m 326d 18H 0M 0S
9: 1978-01-01 36y 0m 327d 0H 0M 0S
10: 1979-01-01 35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S
选项2:如果您不喜欢选项1的格式,可以执行以下操作:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
birth age
1: 1970-01-01 44.92603
2: 1971-01-01 43.92603
3: 1972-01-01 42.92603
4: 1973-01-01 41.92329
5: 1974-01-01 40.92329
6: 1975-01-01 39.92329
7: 1976-01-01 38.92329
8: 1977-01-01 37.92055
9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055
相信选项2应该更合意。
答案 3 :(得分:1)
(Sys.Date() - yourDate) / 365.25
答案 4 :(得分:0)
在计算闰年时间的几个月或几年的年龄时,我对任何回答都不满意,所以这是我使用lubridate包的功能。
基本上,它将from
和to
之间的间隔切换为(最多)年度块,然后调整该块是否为闰年的间隔。总间隔是每个块的年龄之和。
library(lubridate)
#' Get Age of Date relative to Another Date
#'
#' @param from,to the date or dates to consider
#' @param units the units to consider
#' @param floor logical as to whether to floor the result
#' @param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' @author Nicholas Hamilton
#' @export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {
#Account for Leap Year if Working in Months and Years
if(!simple && length(grep("^(month|year)",units)) > 0){
df = data.frame(from,to)
calc = sapply(1:nrow(df),function(r){
#Start and Finish Points
st = df[r,1]; fn = df[r,2]
#If there is no difference, age is zero
if(st == fn){ return(0) }
#If there is a difference, age is not zero and needs to be calculated
sign = +1 #Age Direction
if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign
#Determine the slice-points
mid = ceiling_date(seq(st,fn,by='year'),'year')
#Build the sequence
dates = unique( c(st,mid,fn) )
dates = dates[which(dates >= st & dates <= fn)]
#Determine the age of the chunks
chunks = sapply(head(seq_along(dates),-1),function(ix){
k = 365/( 365 + leap_year(dates[ix]) )
k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
})
#Sum the Chunks, and account for direction
sign*sum(chunks)
})
#If Simple Calculation or Not Months or Not years
}else{
calc = interval(from,to) / duration(num = 1, units = units)
}
if (floor) calc = as.integer(floor(calc))
calc
}
答案 5 :(得分:0)
我更喜欢使用lubridate
包,借用我最初在另一个post中遇到的语法。
根据R日期对象标准化您的输入日期是必要的,最好使用lubridate::mdy()
或lubridate::ymd()
或类似的功能(如果适用)。您可以使用interval()
函数生成描述两个日期之间经过的时间间隔,然后使用duration()
函数定义此间隔应该如何&#34; diced&#34;。
我总结了使用R中最新语法计算下面两个日期的年龄的最简单案例。
df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/
duration(n=1, unit="years")
年龄可以使用基础R&#39; floor()`函数向下舍入到最接近的完整整数,如下所示:
df$Calc_AgeF <- floor(df$Calc_Age)
或者,基本R digits=
函数中的round()
参数可用于向上或向下舍入,并指定返回值中的确切小数位数,如下所示:
df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer
值得注意的是,一旦输入日期通过上述计算步骤(即interval()
和duration()
函数),返回的值将是数字,不再是R中的日期对象。这很重要,而lubridate::floor_date()
严格限于日期时间对象。
无论输入日期是出现在data.table
还是data.frame
对象中,上述语法都能正常工作。
答案 6 :(得分:0)
我想要的实现方式不会使我的依赖关系增加到data.table
以上,而这通常是我唯一的依赖关系。 data.table
仅在mday(即每月的一天)中需要。
当我考虑某人的年龄时,这是我的大脑工作的功能:
require(data.table)
agecalc <- function(origin, current){
y <- year(current) - year(origin) - 1
offset <- 0
if(month(current) > month(origin)) offset <- 1
if(month(current) == month(origin) &
mday(current) >= mday(origin)) offset <- 1
age <- y + offset
return(age)
}
这是重构和向量化的逻辑:
agecalc <- function(origin, current){
age <- year(current) - year(origin) - 1
ii <- (month(current) > month(origin)) | (month(current) == month(origin) &
mday(current) >= mday(origin))
age[ii] <- age[ii] + 1
return(age)
}
您还可以在mm-dd部分进行字符串比较。我可以想象字符串比较可能更快的场景。如果您将年份作为数字,并将生日作为字符串。
agecalc <- function(origin, current){
origin <- as.character(origin)
current <- as.character(current)
age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
if(substr(current, 6, 10) >= substr(origin, 6, 10)){
age <- age + 1
}
return(age)
}
一些测试:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))
## Requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d
答案 7 :(得分:0)
在不使用任何额外包的情况下计算两个日期的年龄的一种非常简单的方法可能是:
df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))