我有一个如下所示的数据框:
df <- data.frame(
"Subject" = c("A", "B", "C", "A", "B", "C"),
"Stage" = c( "pre", "pre", "pre", "post", "post", "post"),
"Value" = c(.5, .2, .7, 1, .8, .3)
)
我想在每行的pre
和post
列之间找到增量,例如:
delta <- df[df$Stage == 'post' & df$Subject == 'A', ] - df[df$Stage == 'pre' & df$Subject == 'B', ]
然后将其添加到数据框的自己的列中:
df[["delt"]] <- NA
df[df$Subject == 'A', ][["delta"]] <- delta
以上是有效的,但当我尝试使用此方法对所有行执行此操作时:
subjects <- unique(df$Subjects)
df[["delt"]] <- NA
calc_deltas <- function(position) {
subject <- subjects[position]
delta <- df[df$Stage == 'post' & df$Subject == subject, ] - df[df$Stage == 'pre' & df$Subject == subject, ]
df[df$Subject == subject, ][["delta"]]
}
lapply(1:length(subjects), calc_deltas)
我希望结果如下:
Subject Stage Value delta
1 A pre 0.5 0.5
2 B pre 0.2 0.6
3 C pre 0.7 -0.4
4 A post 1.0 0.5
5 B post 0.8 0.6
6 C post 0.3 -0.4
但他们却是这样出来的:
Subject Stage Value delta
1 A pre 0.5 NA
2 B pre 0.2 NA
3 C pre 0.7 NA
4 A post 1.0 NA
5 B post 0.8 NA
6 C post 0.3 NA
行保留为NA值。
我的问题很复杂,因为我希望在每行中的多组观察中做到这一点,我想动态添加列(我想使用脚本来处理数据集,虽然基本相同,有不同的团体)。这是我的脚本中的函数:
find_deltas <- function(pre_post_survey_data, question_groups) {
output <- pre_post_survey_data
users <- unique(output$username)
calc_deltas <- function (position, name) {
user <- users[position]
calc_col_name <- paste0(name, "_mean")
save_col_name <- paste0(name, "_delta")
output[output$username == user, ][[save_col_name]] <- output[output$Step == 'post' & output$username == user, ][[calc_col_name]] - output[output$Step == 'pre' & output$username == user, ][[calc_col_name]]
print(output[output$username == user, ][[save_col_name]])
}
for (name in names(question_groups)) {
output[[paste0(name, "_delta")]] <- NA
lapply(1:length(users), calc_deltas, name)
}
output[["total_delta"]] <- NA
lapply(1:length(users), calc_deltas, 'total')
output
}
这种情况下的输出现在看起来像这样(添加的delta列位于最右边,值都是NA,它们应该是计算值)(这是示例数据,实际上并没有加起来应该如上所述,参考上面的例子:
+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+
| Subject | Stage | qGroupA.1 | qGroupA.2 | qGroupA.3 | qGroupB.1 | qGroupB.2 | qGroupB.3 | qGroupB.4 | qGroupC.1 | qGroupC.2 | qGroupC.3 | qGroupC.4 | qGroupC.5 | qGroupA_mean | qGroupB_mean | qGroupC_mean | total_mean | qGroupA_delta | qGroupB_delta | qGroupC_delta | total_delta |
+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+
| A | pre | 0.666 | 0.333 | NA | 1 | 0.666 | 0.666 | 1 | 1 | 1 | 1 | 0.666 | 1 | 0.4995 | 0.833 | 0.9332 | 0.897037394 | NA | NA | NA | NA |
| B | pre | NA | 0.333 | 0.333 | 0.666 | 0.666 | 0.666 | 0.666 | 0.666 | 0.666 | 0.666 | 0.666 | 0.666 | 0.333 | 0.666 | 0.666 | 0.575108097 | NA | NA | NA | NA |
| A | post | 0.666 | 0.333 | 0.333 | 0.666 | 0.666 | 1 | 0.666 | 1 | 1 | 1 | 1 | 1 | 0.444 | 0.7495 | 1 | 0.897037394 | NA | NA | NA | NA |
| B | post | 0.666 | NA | 0.333 | 1 | 0 | 1 | 1 | 1 | 0.666 | 0.666 | 1 | 1 | 0.4995 | 0.75 | 0.8664 | 0.575108097 | NA | NA | NA | NA |
+---------+-------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------+--------------+--------------+-------------+---------------+---------------+---------------+-------------+
为什么我无法在lapply函数中将值添加到数据框中?是否有另一种更直接的方式来做我想做的事情?
答案 0 :(得分:1)
我不清楚你想要什么,这是尝试使用dplyr
。
df <- data.frame(
"Subject" = c("A", "B", "C", "A", "B", "C"),
"Stage" = c( "pre", "pre", "pre", "post", "post", "post"),
"Value" = c(.5, .2, .7, 1, .8, .3)
)
使用上述数据集,只尝试填充delta
列,我的方法是:
library(dplyr)
library(reshape2)
df %>%
dcast(., Subject ~ Stage, value.var = "Value") %>%
mutate(delta = post-pre)
Subject post pre delta
1 A 1.0 0.5 0.5
2 B 0.8 0.2 0.6
3 C 0.3 0.7 -0.4
但是,如果您确实需要long
格式:
df %>%
dcast(., Subject ~ Stage, value.var = "Value") %>%
mutate(delta = post-pre) %>%
melt(., id.vars = c('Subject','delta'), variable.name = "Stage", value.name = "Value")
Subject delta Stage Value
1 A 0.5 post 1.0
2 B 0.6 post 0.8
3 C -0.4 post 0.3
4 A 0.5 pre 0.5
5 B 0.6 pre 0.2
6 C -0.4 pre 0.7
此方法适用于任意数量的变量,您可以使用mutate
的{{1}}功能添加更多列。
答案 1 :(得分:1)
这类问题的一般方法是所谓的拆分 - 应用 - 组合策略。例如,您可以使用R基函数split
,lapply
和unsplit
:
> df.1 <- split(df, df$Subject)
> df.2 <- lapply(df.1, function(x)
+ within(x, delta <- diff(setNames(Value, Stage)[c('post', 'pre')])))
> df.3 <- unsplit(df.2, df$Subject)
> df.3
Subject Stage Value delta
1 A pre 0.5 -0.5
2 B pre 0.2 -0.6
3 C pre 0.7 0.4
4 A post 1.0 -0.5
5 B post 0.8 -0.6
6 C post 0.3 0.4
如果需要,您可以将这三个步骤合并为一个。