我有一个data.frame,其中包含两个日期列,一个用于个人的出生日期(DOB
),以及一个参考时间点(Snapshot.Date
),让我们说这是我们上次见到那个人的日子。还有其他列(省略),因此我理想地将结果作为列添加到现有data.frame中。
我想计算上个生日(相对于Snapshot.Date)和Snapshot.Date
之间的月数(连续)。
我已经尝试了一个plyr解决方案和一个基本的解决方案,它们都比我预期的要慢 - (而且我需要在我的真实'中处理一百万行。 data.frame)
首先,这是一个测试数据集。 20个原始记录(2月29日的特殊情况,仅存在于闰年)。
data.test = structure(list(Snapshot.Date = structure(c(1433030400, 1396224000,
1375228800, 1396224000, 1383177600, 1362009600, 1367280000, 1369958400,
1346371200, 1348963200, 1435622400, 1435622400, 1435622400, 1435622400,
1435622400, 1435622400, 1435622400, 1435622400, 1435622400, 1346371200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DOB = structure(c(-268790400,
-155692800, -955065600, -551232000, -149644800, -774230400, -485395200,
-17625600, -131932800, -387244800, 545961600, 18489600, -230515200,
441676800, -32745600, 775180800, 713491200, 483235200, 114307200,
-815443200), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Snapshot.Date",
"DOB"), row.names = c(32806L, 21294L, 14880L, 21730L, 17525L,
8516L, 11068L, 11751L, 2564L, 3832L, 802276L, 1031697L, 129222L,
588224L, 1093247L, 878037L, 370736L, 709108L, 861908L, 2199L), class = "data.frame")
计算月份的功能(我确定这也可以改进)。
months_since_last_birthday = function(CurrentDate, DateOfBirth)
{
last_birthday = DateOfBirth
if(month(last_birthday) == 2 & day(last_birthday) == 29) # this birthday only occurs once every four years, let's reset them to be the 28th
{
day(last_birthday) = 28
}
year(last_birthday) = year(CurrentDate)
if(last_birthday > CurrentDate)
{
last_birthday = last_birthday - years(1) #last year's birthday is the most recent occurrence
}
return(as.period(new_interval(last_birthday, CurrentDate)) / months(1))
}
对于基数为20的记录,这是所需的输出:
Snapshot.Date DOB Months.Since.Birthday
32806 2015-05-31 1961-06-26 11.1643836
21294 2014-03-31 1965-01-25 2.1972603
14880 2013-07-31 1939-09-27 10.1315068
21730 2014-03-31 1952-07-14 8.5589041
17525 2013-10-31 1965-04-05 6.8547945
8516 2013-02-28 1945-06-20 8.2630137
11068 2013-04-30 1954-08-15 8.4931507
11751 2013-05-31 1969-06-11 11.6575342
2564 2012-08-31 1965-10-27 10.1315068
3832 2012-09-30 1957-09-24 0.1972603
802276 2015-06-30 1987-04-21 2.2958904
1031697 2015-06-30 1970-08-03 10.8876712
129222 2015-06-30 1962-09-12 9.5917808
588224 2015-06-30 1983-12-31 5.9863014
1093247 2015-06-30 1968-12-18 6.3945205
878037 2015-06-30 1994-07-26 11.1315068
370736 2015-06-30 1992-08-11 10.6246575
709108 2015-06-30 1985-04-25 2.1643836
861908 2015-06-30 1973-08-16 10.4602740
2199 2012-08-31 1944-02-29 6.0986301
扩展数据集以进行基准测试:
# Make 5000 records total for benchmarking, didn't replicate Feb 29th
# since it is a very rare case in the data
set.seed(1)
data.test = rbind(data.test, data.test[sample(1:19, size = 4980, replace = TRUE),])
start.time = Sys.time()
res = suppressMessages(adply(data.test , 1, transform, Months.Since.Birthday = months_since_last_birthday(Snapshot.Date, DOB)))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.793945 mins
start.time = Sys.time()
data.test$Months.Since.Birthday = suppressMessages(sapply(1:5000, function(x){return(months_since_last_birthday(data.test$Snapshot.Date[x], data.test$DOB[x]))}))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.743053 mins
我做错了什么吗?这对你来说真的很慢吗? 欢迎任何反馈!
答案 0 :(得分:1)
除非我遗漏了一些明显的内容,否则在R
中有许多内置的处理时间数据的方法,尤其是base::difftime
,这可能为您节省了一些麻烦。
采用您的上述数据集data.test
:
data.test$dif <- round(as.vector(as.double(difftime(strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"), strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days"))) %% 365, 1)
或更逻辑地列出(如果您复制粘贴它,这将不起作用)。
data.test$dif <-
round(
as.vector(
as.double(
difftime(
strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"),
strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days")
)
)
%% 365,
1)
以上使用difftime
函数以天数的形式查找给定格式(format = "%Y-%m-%d"
)的两个日期之间的差异,然后执行余数除法以获取自上次以来的天数生日。我个人认为这是比几个月更好的衡量标准,因为7月到8月之间2个月的差异与1月到2月之间2个月差异的天数不同。
注意:上述解决方案并未包含闰年。您可以轻松查找闰年列表,并在检查中添加1天,或者从生活在闰年的每个人的生日中减去1天,以获得准确的数字。