Question

我意识到关于这个主题有很多问题，但我无法通过查看各种答案来解决我的问题。我有一个df - 其摘录附在下面：

ID = as.factor(c("1","1","1","1","1",
                 "2","2","2",
                 "3","3","3","3",
                 "4","4","4","4","4"))
AdDate = c("2010-03-04", "2010-04-05", "2011-01-23", "2011-03-20", "2012-07-08",
           "2010-12-02", "2011-05-17", "2011-09-11",
           "2010-04-11", "2010-05-15", "2011-02-22", "2011-09-23",
           "2009-10-04", "2010-02-15", "2010-08-17", "2011-06-20", "2012-04-08")
OpofInterest = c("FALSE", "FALSE", "TRUE", "FALSE", "FALSE",
                 "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE", "FALSE")
df = data.frame(ID, AdDate, OpofInterest)

我当时要做的是将df按ID分成多个数据帧（本例中为4），然后应用下面的函数指定其他剧集（每行）是否在（手术前）之前，相同（每次手术），或在基于AdDate的每个人（ID）的兴趣（手术后）操作之后。我是R的新手，编程并在下面创建了一个函数。实际上，我有成千上万的ID和剧集，大约有80列，所以我无法单独分组并应用我在调整后开始工作的功能。

prepostassignment <- function (df) {

df_OpofInterest = subset(df,(df["OpofInterest"] == "TRUE"))  

for (i in 1:nrow(df)) {

if (df$AdDate[i] < df_OpofInterest$AdDate) {
    df$Pre_Post_Assignment[i] = "Pre"

} else if (df$AdDate[i] == df_OpofInterest$AdDate) {
  df$Pre_Post_Assignment[i] = "Per"

} else if (df$AdDate[i] > df_OpofInterest$AdDate) { 
  df$Pre_Post_Assignment[i] = "Post"

  }
 }
}

我玩过，tapply，聚合，ddply，似乎无法想出一个解。在手动子集上使用该功能时，我也收到以下错误消息：

缺少需要TRUE / FALSE的值

我也读过这篇文章，但无法理解我的特定代码中出错的地方

我想要的最终结果如下：

ID = as.factor(c("1","1","1","1","1",
                 "2","2","2",
                 "3","3","3","3",
                 "4","4","4","4","4"))
AdDate = c("2010-03-04", "2010-04-05", "2011-01-23", "2011-03-20", "2012-07-08",
           "2010-12-02", "2011-05-17", "2011-09-11",
           "2010-04-11", "2010-05-15", "2011-02-22", "2011-09-23",
           "2009-10-04", "2010-02-15", "2010-08-17", "2011-06-20", "2012-04-08")
OpofInterest = c("FALSE", "FALSE", "TRUE", "FALSE", "FALSE",
                 "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE",
                 "FALSE", "FALSE", "TRUE", "FALSE", "FALSE")
Pre_Post_Assignment = c("Pre", "Pre", "Per", "Post", "Post",
                        "Pre", "Per", "Post",
                        "Pre", "Pre", "Per", "Post",
                        "Pre", "Pre", "Per", "Post", "Post")
df_new = data.frame(ID, AdDate, OpofInterest, Pre_Post_Assignment)

非常感谢任何帮助。

感谢。

Answer 1

这是经典的分组 - 应用 - 组合分析。以下是使用data.table的选项：

df = data.frame(ID, AdDate, OpofInterest, stringsAsFactors=FALSE)
df$OpofInterest <- as.logical(df$OpofInterest)
library(data.table)
dt <- data.table(df)
dt[, 
  cbind(
    .SD,
    Pre_Post_Assignment=
      ifelse(
         AdDate < AdDate[OpofInterest], 
         "Pre",
         ifelse(AdDate == AdDate[OpofInterest], "Per", "Post"
    ) ) ), 
  by=ID]
#     ID     AdDate OpofInterest Pre_Post_Assignment
#  1:  1 2010-03-04        FALSE                 Pre
#  2:  1 2010-04-05        FALSE                 Pre
#  3:  1 2011-01-23         TRUE                 Per
#  4:  1 2011-03-20        FALSE                Post
#  5:  1 2012-07-08        FALSE                Post
#  6:  2 2010-12-02        FALSE                 Pre
#  7:  2 2011-05-17         TRUE                 Per
#  8:  2 2011-09-11        FALSE                Post
#  9:  3 2010-04-11        FALSE                 Pre
# 10:  3 2010-05-15        FALSE                 Pre
# 11:  3 2011-02-22         TRUE                 Per
# 12:  3 2011-09-23        FALSE                Post
# 13:  4 2009-10-04        FALSE                 Pre
# 14:  4 2010-02-15        FALSE                 Pre
# 15:  4 2010-08-17         TRUE                 Per
# 16:  4 2011-06-20        FALSE                Post
# 17:  4 2012-04-08        FALSE                Post

您也可以使用ddply。实际计算的内容是两个嵌套的ifelse语句。 [.data.table的第二个参数是除了split / grouping columnn（此处为ID）之外，我们在输出中需要的列的列表。 .SD变量是一个特殊的data.table变量，它包含组中by参数中未引用的所有列（此处包含AdDate和{{1} }}）。我们OpofInterest我们的额外向量cbind以使用额外列创建新结果。

其他一些值得注意的要点：

我将日期转换为字符串以便比较工作
我将.SD转换为逻辑

最后，免责声明，虽然这里执行的分析类型是split-apply-combine，但OpofInterest中的幕后实现不会拆分，而是应用，而不是它的子集和迭代（I＆＃39;我在这里注意到这一点，所以Arun并没有生我的气。）

data.table

我认为它更清洁，也更快。

按因子拆分df，应用函数，并在r中返回组合df

1 个答案: