根据零组添加ID

时间:2017-09-08 06:49:37

标签: r

我的数据集很大,在测试会话的设定期间(期间)内对个人(姓名)包含许多观察(从属变量= DV)。我的数据集的一个小例子如下:

ExampleData <- data.frame(Name = c("Tom","Tom","Tom","Tom","Tom","Tom","Tom","Tom", "Tom", "Tom", 
                                   "Ben","Ben","Ben","Ben","Ben","Ben","Ben","Ben", "Ben", "Ben"),
                          Period = c(0,0,1,1,1,0,0,0,1,1, 
                                      0,0,0,1,1,1,0,0,1,1),
                          DV = runif(20, 1.5, 2.8))

ExampleData$Period==1个体进行运动测试时,其时间/长度会有所不同。每个测试之间的中断由ExampleData$Period==0表示。为避免在一个人进行测试并添加连续句点时手动输入,我希望包含一个列,该列声明何时由一组0分开的1组是新的期间 - 跨越每个人的数据。我该怎么做呢?

我的预期输出是:

ExampleData$Descriptor <- c(NA,NA,"Period One", "Period One","Period One",NA,NA,NA,"Period Two","Period Two",
                        NA,NA,NA,"Period One","Period One","Period One",NA,NA,"Period Two","Period Two")

我的问题类似于我的另一个问题,位于here,但我现在每个人都有多个条目。我尝试过dplyr语法:

Test_df <- ExampleData %>%
  mutate(
    Descriptor = case_when(
      Period > 0 ~ "Period",
      Period == 0 ~ "Rest"),
    rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")), 
    Descriptor = case_when(
      Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
      TRUE ~ "Rest"),
    rleid = NULL
  )

虽然,我如何在数据集中考虑每个不同的姓名/个人?

谢谢。

4 个答案:

答案 0 :(得分:5)

这是dplyr

的另一种方法
library(dplyr)

ExampleData %>% 
  group_by(Name) %>% 
  mutate(Descriptor = with(rle(Period == 1), 
             rep(replace(paste("Period", cumsum(values)), !values, NA), lengths)))

# # A tibble: 20 x 4
# # Groups:   Name [2]
# Name Period       DV Descriptor
# <fctr>  <dbl>    <dbl>      <chr>
#   1    Tom      0 2.641044       <NA>
#   2    Tom      0 2.692745       <NA>
#   3    Tom      1 1.515797   Period 1
#   4    Tom      1 2.601471   Period 1
#   5    Tom      1 1.669399   Period 1
#   6    Tom      0 2.700371       <NA>
#   7    Tom      0 1.993971       <NA>
#   8    Tom      0 2.203379       <NA>
#   9    Tom      1 2.488742   Period 2
#  10    Tom      1 1.596458   Period 2
#  11    Ben      0 2.578924       <NA>
#  12    Ben      0 1.916804       <NA>
#  13    Ben      0 2.676466       <NA>
#  14    Ben      1 2.508759   Period 1
#  15    Ben      1 2.447217   Period 1
#  16    Ben      1 2.728756   Period 1
#  17    Ben      0 2.326854       <NA>
#  18    Ben      0 1.748016       <NA>
#  19    Ben      1 1.703044   Period 2
#  20    Ben      1 1.783434   Period 2

答案 1 :(得分:3)

以下是使用data.table

的选项
library(data.table)
setDT(ExampleData)[ , grp := rleid(Period == 1), .(Name)][Period == 1, 
    Descriptor := paste("Period", match(grp, unique(grp))), Name][, grp := NULL][]
#     Name Period       DV Descriptor
# 1:  Tom      0 2.764916         NA
# 2:  Tom      0 1.537837         NA
# 3:  Tom      1 1.848110   Period 1
# 4:  Tom      1 2.621724   Period 1
# 5:  Tom      1 2.206875   Period 1
# 6:  Tom      0 1.715299         NA
# 7:  Tom      0 1.882378         NA
# 8:  Tom      0 2.244155         NA
# 9:  Tom      1 2.094944   Period 2
#10:  Tom      1 1.713493   Period 2
#11:  Ben      0 1.794261         NA
#12:  Ben      0 1.608199         NA
#13:  Ben      0 2.053490         NA
#14:  Ben      1 1.791563   Period 1
#15:  Ben      1 1.652090   Period 1
#16:  Ben      1 2.510483   Period 1
#17:  Ben      0 2.345984         NA
#18:  Ben      0 2.754110         NA
#19:  Ben      1 1.675527   Period 2
#20:  Ben      1 1.709622   Period 2

答案 2 :(得分:2)

Base R选项:

unlist(with(ExampleData, tapply(Period, Name, function(x) c(0, cumsum(ifelse(diff(x) < 0, 0, diff(x)))) * x)))

答案 3 :(得分:0)

我能够通过运行以下命令成功完成此操作:

Test_df <- ExampleData %>%
  group_by(Name) %>%
  mutate(
    Descriptor = case_when(
      Period > 0 ~ "Period",
      Period == 0 ~ "Rest"),
    rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")), 
    Descriptor = case_when(
      Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
      TRUE ~ "Rest"),
    rleid = NULL
  )

我也使用过&#34;休息&#34;而不是NA,因为这更准确地描绘了所发生的事情。