我正在尝试使用ddply
与transform
一起填充变量summary_Date
和ID
的数据框中的新变量Date
)。根据使用ifelse
评估的作品的长度来选择变量的值:
如果某个月内ID的观察结果少于五个,我希望通过将日期四舍五入到最接近的月份来计算summary_Date
(使用round_date
包中的lubridate
});如果某个月内ID的观察次数超过五次,我希望summary_Date
只是Date
。
require(plyr)
require(lubridate)
test.df <- structure(
list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2)
, Date = structure(c(-247320000, -246196800, -245073600, -243864000
, -242654400, -241444800, -126273600, -123595200
, -121176000, -118497600, 1359385200, 1359388800
, 1359392400, 1359396000, 1359399600, 1359403200
, 1359406800, 1359410400, 1359414000, 1359417600
, 55598400, 56116800, 58881600, 62078400, 64756800
, 67348800, 69854400, 72964800, 76161600, 79012800
, 1358589600, 1358676000, 1358762400, 1358848800
, 1358935200, 1359021600, 1359108000, 1359194400
, 1359280800, 1359367200), tzone = "GMT"
, class = c("POSIXct", "POSIXt"))
, Val=rnorm(40))
, .Names = c("ID", "Date", "Val"), row.names = c(NA, 40L)
, class = "data.frame")
test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, summary_Date=as.POSIXct(ifelse(length(ID)<5
, round_date(Date, "month")
,Date)
, origin="1970-01-01 00:00.00"
, tz="GMT")
# Included length_x to easily see the length of the subset
, length_x = length(ID))
head(test.df,5)
# floor_date(Date, "month") ID Date Val summary_Date length_x
# 1 1962-03-01 1 1962-03-01 12:00:00 -0.1037988 1962-03-01 3
# 2 1962-03-01 1 1962-03-14 12:00:00 0.2923056 1962-03-01 3
# 3 1962-03-01 1 1962-03-27 12:00:00 0.4435410 1962-03-01 3
# 4 1962-04-01 1 1962-04-10 12:00:00 0.1159164 1962-04-01 2
# 5 1962-04-01 1 1962-04-24 12:00:00 2.9824075 1962-04-01 2
ifelse
语句似乎有效,但'summary_Date'中的值似乎是为变换正在处理的子集计算的第一个值,而不是行特定值。例如,在第3行中,summary_Date
应为1962-04-01
,因为日期1962-03-27 12:00:00'
应该向上舍入(因为子集中的行少于5行),而是第一个计算的值summary_Date
(1962-03-01
)在该子集的所有行中重复。
编辑:我使用data.table
使用ddply
分两步尝试使用test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, length_x = length(ID))
test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, summary_Date=as.POSIXct(ifelse(length_x<5
, round_date(Date, "month")
,Date)
, origin="1970-01-01 00:00.00"
, tz="GMT"))
head(test.df,5)[c(1,3:7)]
# floor_date(Date, "month") ID Date Val length_x summary_Date
# 1 1962-03-01 1 1962-03-01 12:00:00 -0.1711212 3 1962-03-01
# 2 1962-03-01 1 1962-03-14 12:00:00 -0.1531571 3 1962-03-01
# 3 1962-03-01 1 1962-03-27 12:00:00 0.1256238 3 1962-04-01
# 4 1962-04-01 1 1962-04-10 12:00:00 1.4481225 2 1962-04-01
# 5 1962-04-01 1 1962-04-24 12:00:00 -0.6508731 2 1962-05-01
来启发Ricardo的答案。它也有效:
{{1}}
答案 0 :(得分:7)
一步ddply
解决方案(也发布为评论)
ddply(test.df, .(ID, floor_date(Date, "month")), mutate,
length_x = length(ID),
summary_Date=as.POSIXct(ifelse(length_x < 5, round_date(Date, "month") ,Date)
, origin="1970-01-01 00:00.00", tz="GMT")
)
答案 1 :(得分:1)
# transform to data.table
library(data.table)
test.dt <- data.table(test.df)
# calculate length of id by month-year.
test.dt[, idlen := length(ID), by=list(month(Date), year(Date)) ]
# calculate the summary date
test.dt[, summary_Date := ifelse(idlen<5, as.Date(round_date(Date, "month")), as.Date(Date))]
# If you would like to have it formatted add the following:
test.dt[, summary_Date := as.Date(summary_Date, origin="1970-01-01")]
> test.dt
ID Date Val idlen summary_Date
1: 1 1962-03-01 12:00:00 0.42646422 3 1962-03-01
2: 1 1962-03-14 12:00:00 -0.29507148 3 1962-03-01
3: 1 1962-03-27 12:00:00 0.89512566 3 1962-04-01 <~~~~~
4: 1 1962-04-10 12:00:00 0.87813349 2 1962-04-01
5: 1 1962-04-24 12:00:00 0.82158108 2 1962-05-01
6: 1 1962-05-08 12:00:00 0.68864025 1 1962-05-01
无法一步完成的原因与您每组只获得一个值的事实有关。将该值分配给组的成员时,您将1个元素分配给多个。 R
知道如何很好地处理这种情况:recycling
单个元素。
但是,在这种具体情况下,您不想回收;相反,您不希望将1
元素应用于many
。因此,您需要唯一的组,这是我们在第二步中所做的。然后为该组的每个元素(行)分配其自己的特定值。
@Ramnath提出了使用mutate
的好建议。看一下?mutate
,就会给出:
此函数与transform非常相似,但它迭代地执行转换... 以后的转换可以使用早期转换创建的列
这正是你想要做的!