Question

给定一个包含不同组的非常大的纵向数据集，我需要创建一个标志，指示每个组（code）之间某个变量（year）的第一个变化，每组（{{ 1}}）。同一年内id的观察结果只表示不同的群体成员。

示例数据：

type

我需要的是在几年之间标记组内library(tidyverse) sample <- tibble(id = rep(1:3, each=6), year = rep(2010:2012, 3, each=2), type = (rep(1:2, 9)), code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))的第一次更改。第二个变化并不重要。缺失代码（code）可视为""，但无论如何不应影响NA。以下是带有标志字段的上述元素，应该是：

flag

我仍然有一个循环的心态，我正在尝试使用矢量化的dplyr来做我需要的事情。任何意见都将非常感谢！

编辑：感谢您指出# A tibble: 18 × 5 id year type code flag <int> <int> <int> <chr> <dbl> 1 1 2010 1 abc 0 2 1 2010 2 abc 0 3 1 2011 1 0 4 1 2011 2 0 5 1 2012 1 xyz 1 6 1 2012 2 xyz 1 7 2 2010 1 0 8 2 2010 2 0 9 2 2011 1 lmn 0 10 2 2011 2 0 11 2 2012 1 efg 1 12 2 2012 2 efg 1 13 3 2010 1 def 0 14 3 2010 2 def 0 15 3 2011 1 1 16 3 2011 2 klm 1 17 3 2012 1 nop 1 18 3 2012 2 nop 1的重要性。 ID按年排列，因为此处的排序非常重要，而且每year每types id个year需要具有相同的标记。因此，在编辑的第15行中，e代码为""，这不能保证自身发生变化，但由于同一年第16行有一个新的code，因此两个观察都需要将其代码更改为1。

Answer 1

我们可以使用data.table

library(data.table)
setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
#    id year type code flag
# 1:  1 2010    1  abc    0
# 2:  1 2010    2  abc    0
# 3:  1 2011    1         0
# 4:  1 2011    2         0
# 5:  1 2012    1  xyz    1
# 6:  1 2012    2  xyz    1
# 7:  2 2010    1         0
# 8:  2 2010    2         0
# 9:  2 2011    1  lmn    0
#10:  2 2011    2         0
#11:  2 2012    1  efg    1
#12:  2 2012    2  efg    1
#13:  3 2010    1  def    0
#14:  3 2010    2  def    0
#15:  3 2011    1  klm    1
#16:  3 2011    2  klm    1
#17:  3 2012    1  nop    1
#18:  3 2012    2  nop    1

更新

如果我们还需要包括'年'，

setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code, year)-1
                   cummax(rl*(rl < 2)) }, id]

Answer 2

使用dplyr的可能解决方案。不确定它是最干净的方式

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

输出

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

Answer 3

使用data.table - 包的简短解决方案：

library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]

或者：

setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]

给出了期望的结果：

> samp
    id year type code flag
 1:  1 2010    1  abc    0
 2:  1 2010    2  abc    0
 3:  1 2011    1         0
 4:  1 2011    2         0
 5:  1 2012    1  xyz    1
 6:  1 2012    2  xyz    1
 7:  2 2010    1         0
 8:  2 2010    2         0
 9:  2 2011    1  lmn    0
10:  2 2011    2         0
11:  2 2012    1  efg    1
12:  2 2012    2  efg    1
13:  3 2010    1  def    0
14:  3 2010    2  def    0
15:  3 2011    1  klm    1
16:  3 2011    2  klm    1
17:  3 2012    1  nop    1
18:  3 2012    2  nop    1

当年份也相关时：

setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]

可能的基础R替代方案：

f <- function(x) {
  x <- rle(x)$lengths
  1 * (rep(seq_along(x), times=x) - 1 > 0)
}

samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))

注意：最好不要让对象与函数同名。

使用过的数据：

samp <- data.frame(id = rep(1:3, each=6),
                   year = rep(2010:2012, 3, each=2),
                   type = (rep(1:2, 9)),
                   code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))

如何在每组年份之间标记变量值的第一次更改？

3 个答案:

更新