R计算并添加每行的两个数据帧单元的增量

时间:2016-06-11 11:01:24

标签: r loops dataframe

我有一个如下所示的数据框:

   df <- data.frame(
    "Subject" = c("A", "B", "C", "A", "B", "C"),
    "Stage" = c( "pre", "pre", "pre", "post", "post", "post"), 
    "Value" = c(.5, .2, .7, 1, .8, .3)
)

我想在每行的prepost列之间找到增量,例如:

delta <- df[df$Stage == 'post' & df$Subject == 'A', ] - df[df$Stage == 'pre' & df$Subject == 'B', ]

然后将其添加到数据框的自己的列中:

df[["delt"]] <- NA
df[df$Subject == 'A', ][["delta"]] <- delta

以上是有效的,但当我尝试使用此方法对所有行执行此操作时:

subjects <- unique(df$Subjects)
df[["delt"]] <- NA
calc_deltas <- function(position) {
   subject <- subjects[position]
   delta <- df[df$Stage == 'post' & df$Subject == subject, ] - df[df$Stage == 'pre' & df$Subject == subject, ]
   df[df$Subject == subject, ][["delta"]]
}
lapply(1:length(subjects), calc_deltas)  

我希望结果如下:

  Subject Stage Value delta
1       A   pre   0.5   0.5
2       B   pre   0.2   0.6
3       C   pre   0.7   -0.4
4       A  post   1.0   0.5
5       B  post   0.8   0.6
6       C  post   0.3   -0.4

但他们却是这样出来的:

  Subject Stage Value delta
1       A   pre   0.5   NA
2       B   pre   0.2   NA
3       C   pre   0.7   NA
4       A  post   1.0   NA
5       B  post   0.8   NA
6       C  post   0.3   NA

行保留为NA值。

我的问题很复杂,因为我希望在每行中的多组观察中做到这一点,我想动态添加列(我想使用脚本来处理数据集,虽然基本相同,有不同的团体)。这是我的脚本中的函数:

find_deltas <- function(pre_post_survey_data, question_groups) {
  output <- pre_post_survey_data

  users <- unique(output$username)

  calc_deltas <- function (position, name) {
    user <- users[position]
    calc_col_name <- paste0(name, "_mean")
    save_col_name <- paste0(name, "_delta")

    output[output$username == user, ][[save_col_name]] <- output[output$Step == 'post' & output$username == user, ][[calc_col_name]] - output[output$Step == 'pre' & output$username == user, ][[calc_col_name]]
    print(output[output$username == user, ][[save_col_name]])
  }

  for (name in names(question_groups)) {
    output[[paste0(name, "_delta")]] <- NA
    lapply(1:length(users), calc_deltas, name)  
  }

  output[["total_delta"]] <- NA
  lapply(1:length(users), calc_deltas, 'total')  

  output
}

这种情况下的输出现在看起来像这样(添加的delta列位于最右边,值都是NA,它们应该是计算值)(这是示例数据,实际上并没有加起来应该如上所述,参考上面的例子:

+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+
| Subject | Stage | qGroupA.1 | qGroupA.2 | qGroupA.3 | qGroupB.1 | qGroupB.2 | qGroupB.3 | qGroupB.4 | qGroupC.1 | qGroupC.2 | qGroupC.3 | qGroupC.4 | qGroupC.5 | qGroupA_mean | qGroupB_mean | qGroupC_mean | total_mean  | qGroupA_delta | qGroupB_delta | qGroupC_delta | total_delta |
+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+
| A       | pre   | 0.666     | 0.333     | NA        | 1         | 0.666     | 0.666     | 1         | 1         | 1         | 1         | 0.666     | 1         | 0.4995       | 0.833        | 0.9332       | 0.897037394 | NA            | NA            | NA            | NA          |
| B       | pre   | NA        | 0.333     | 0.333     | 0.666     | 0.666     | 0.666     | 0.666     | 0.666     | 0.666     | 0.666     | 0.666     | 0.666     | 0.333        | 0.666        | 0.666        | 0.575108097 | NA            | NA            | NA            | NA          |
| A       | post  | 0.666     | 0.333     | 0.333     | 0.666     | 0.666     | 1         | 0.666     | 1         | 1         | 1         | 1         | 1         | 0.444        | 0.7495       | 1            | 0.897037394 | NA            | NA            | NA            | NA          |
| B       | post  | 0.666     | NA        | 0.333     | 1         | 0         | 1         | 1         | 1         | 0.666     | 0.666     | 1         | 1         | 0.4995       | 0.75         | 0.8664       | 0.575108097 | NA            | NA            | NA            | NA          |
+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+

为什么我无法在lapply函数中将值添加到数据框中?是否有另一种更直接的方式来做我想做的事情?

2 个答案:

答案 0 :(得分:1)

我不清楚你想要什么,这是尝试使用dplyr

df <- data.frame(
  "Subject" = c("A", "B", "C", "A", "B", "C"),
  "Stage" = c( "pre", "pre", "pre", "post", "post", "post"), 
  "Value" = c(.5, .2, .7, 1, .8, .3)
)

使用上述数据集,只尝试填充delta列,我的方法是:

library(dplyr)
library(reshape2)
df %>%
  dcast(., Subject ~ Stage, value.var = "Value") %>%
  mutate(delta = post-pre)

  Subject post pre delta
1       A  1.0 0.5   0.5
2       B  0.8 0.2   0.6
3       C  0.3 0.7  -0.4

但是,如果您确实需要long格式:

df %>%
  dcast(., Subject ~ Stage, value.var = "Value") %>%
  mutate(delta = post-pre) %>%
  melt(., id.vars = c('Subject','delta'), variable.name = "Stage", value.name = "Value")

  Subject delta Stage Value
1       A   0.5  post   1.0
2       B   0.6  post   0.8
3       C  -0.4  post   0.3
4       A   0.5   pre   0.5
5       B   0.6   pre   0.2
6       C  -0.4   pre   0.7

此方法适用于任意数量的变量,您可以使用mutate的{​​{1}}功能添加更多列。

答案 1 :(得分:1)

这类问题的一般方法是所谓的拆分 - 应用 - 组合策略。例如,您可以使用R基函数splitlapplyunsplit

> df.1 <- split(df, df$Subject)
> df.2 <- lapply(df.1, function(x)
+     within(x, delta <- diff(setNames(Value, Stage)[c('post', 'pre')])))
> df.3 <- unsplit(df.2, df$Subject)
> df.3
  Subject Stage Value delta
1       A   pre   0.5  -0.5
2       B   pre   0.2  -0.6
3       C   pre   0.7   0.4
4       A  post   1.0  -0.5
5       B  post   0.8  -0.6
6       C  post   0.3   0.4

如果需要,您可以将这三个步骤合并为一个。