我的数据集很大,在测试会话的设定期间(期间)内对个人(姓名)包含许多观察(从属变量= DV)。我的数据集的一个小例子如下:
ExampleData <- data.frame(Name = c("Tom","Tom","Tom","Tom","Tom","Tom","Tom","Tom", "Tom", "Tom",
"Ben","Ben","Ben","Ben","Ben","Ben","Ben","Ben", "Ben", "Ben"),
Period = c(0,0,1,1,1,0,0,0,1,1,
0,0,0,1,1,1,0,0,1,1),
DV = runif(20, 1.5, 2.8))
当ExampleData$Period==1
个体进行运动测试时,其时间/长度会有所不同。每个测试之间的中断由ExampleData$Period==0
表示。为避免在一个人进行测试并添加连续句点时手动输入,我希望包含一个列,该列声明何时由一组0分开的1组是新的期间 - 跨越每个人的数据。我该怎么做呢?
我的预期输出是:
ExampleData$Descriptor <- c(NA,NA,"Period One", "Period One","Period One",NA,NA,NA,"Period Two","Period Two",
NA,NA,NA,"Period One","Period One","Period One",NA,NA,"Period Two","Period Two")
我的问题类似于我的另一个问题,位于here,但我现在每个人都有多个条目。我尝试过dplyr语法:
Test_df <- ExampleData %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
虽然,我如何在数据集中考虑每个不同的姓名/个人?
谢谢。
答案 0 :(得分:5)
这是dplyr
的另一种方法library(dplyr)
ExampleData %>%
group_by(Name) %>%
mutate(Descriptor = with(rle(Period == 1),
rep(replace(paste("Period", cumsum(values)), !values, NA), lengths)))
# # A tibble: 20 x 4
# # Groups: Name [2]
# Name Period DV Descriptor
# <fctr> <dbl> <dbl> <chr>
# 1 Tom 0 2.641044 <NA>
# 2 Tom 0 2.692745 <NA>
# 3 Tom 1 1.515797 Period 1
# 4 Tom 1 2.601471 Period 1
# 5 Tom 1 1.669399 Period 1
# 6 Tom 0 2.700371 <NA>
# 7 Tom 0 1.993971 <NA>
# 8 Tom 0 2.203379 <NA>
# 9 Tom 1 2.488742 Period 2
# 10 Tom 1 1.596458 Period 2
# 11 Ben 0 2.578924 <NA>
# 12 Ben 0 1.916804 <NA>
# 13 Ben 0 2.676466 <NA>
# 14 Ben 1 2.508759 Period 1
# 15 Ben 1 2.447217 Period 1
# 16 Ben 1 2.728756 Period 1
# 17 Ben 0 2.326854 <NA>
# 18 Ben 0 1.748016 <NA>
# 19 Ben 1 1.703044 Period 2
# 20 Ben 1 1.783434 Period 2
答案 1 :(得分:3)
以下是使用data.table
library(data.table)
setDT(ExampleData)[ , grp := rleid(Period == 1), .(Name)][Period == 1,
Descriptor := paste("Period", match(grp, unique(grp))), Name][, grp := NULL][]
# Name Period DV Descriptor
# 1: Tom 0 2.764916 NA
# 2: Tom 0 1.537837 NA
# 3: Tom 1 1.848110 Period 1
# 4: Tom 1 2.621724 Period 1
# 5: Tom 1 2.206875 Period 1
# 6: Tom 0 1.715299 NA
# 7: Tom 0 1.882378 NA
# 8: Tom 0 2.244155 NA
# 9: Tom 1 2.094944 Period 2
#10: Tom 1 1.713493 Period 2
#11: Ben 0 1.794261 NA
#12: Ben 0 1.608199 NA
#13: Ben 0 2.053490 NA
#14: Ben 1 1.791563 Period 1
#15: Ben 1 1.652090 Period 1
#16: Ben 1 2.510483 Period 1
#17: Ben 0 2.345984 NA
#18: Ben 0 2.754110 NA
#19: Ben 1 1.675527 Period 2
#20: Ben 1 1.709622 Period 2
答案 2 :(得分:2)
Base R选项:
unlist(with(ExampleData, tapply(Period, Name, function(x) c(0, cumsum(ifelse(diff(x) < 0, 0, diff(x)))) * x)))
答案 3 :(得分:0)
我能够通过运行以下命令成功完成此操作:
Test_df <- ExampleData %>%
group_by(Name) %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
我也使用过&#34;休息&#34;而不是NA,因为这更准确地描绘了所发生的事情。