给定一个包含不同组的非常大的纵向数据集,我需要创建一个标志,指示每个组(code
)之间某个变量(year
)的第一个变化,每组({{ 1}})。同一年内id
的观察结果只表示不同的群体成员。
示例数据:
type
我需要的是在几年之间标记组内library(tidyverse)
sample <- tibble(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))
的第一次更改。第二个变化并不重要。缺失代码(code
)可视为""
,但无论如何不应影响NA
。以下是带有标志字段的上述元素,应该是:
flag
我仍然有一个循环的心态,我正在尝试使用矢量化的dplyr来做我需要的事情。 任何意见都将非常感谢!
编辑:感谢您指出# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
的重要性。 ID按年排列,因为此处的排序非常重要,而且每year
每types
id
个year
需要具有相同的标记。因此,在编辑的第15行中,e代码为""
,这不能保证自身发生变化,但由于同一年第16行有一个新的code
,因此两个观察都需要将其代码更改为1。
答案 0 :(得分:3)
我们可以使用data.table
library(data.table)
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
# id year type code flag
# 1: 1 2010 1 abc 0
# 2: 1 2010 2 abc 0
# 3: 1 2011 1 0
# 4: 1 2011 2 0
# 5: 1 2012 1 xyz 1
# 6: 1 2012 2 xyz 1
# 7: 2 2010 1 0
# 8: 2 2010 2 0
# 9: 2 2011 1 lmn 0
#10: 2 2011 2 0
#11: 2 2012 1 efg 1
#12: 2 2012 2 efg 1
#13: 3 2010 1 def 0
#14: 3 2010 2 def 0
#15: 3 2011 1 klm 1
#16: 3 2011 2 klm 1
#17: 3 2012 1 nop 1
#18: 3 2012 2 nop 1
如果我们还需要包括'年',
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code, year)-1
cummax(rl*(rl < 2)) }, id]
答案 1 :(得分:2)
使用dplyr
的可能解决方案。不确定它是最干净的方式
sample %>%
group_by(id) %>%
#find first year per group where code exists
mutate(first_year = min(year[code != ""])) %>%
#gather all codes from first year (does not assume code is constant within year)
mutate(first_codes = list(code[year==first_year])) %>%
#if year is not first year & code not in first year codes & code not blank
mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>%
#drop created columns
select(-first_year, -first_codes) %>%
ungroup()
输出
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 klm 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
答案 2 :(得分:2)
使用data.table
- 包的简短解决方案:
library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]
或者:
setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]
给出了期望的结果:
> samp
id year type code flag
1: 1 2010 1 abc 0
2: 1 2010 2 abc 0
3: 1 2011 1 0
4: 1 2011 2 0
5: 1 2012 1 xyz 1
6: 1 2012 2 xyz 1
7: 2 2010 1 0
8: 2 2010 2 0
9: 2 2011 1 lmn 0
10: 2 2011 2 0
11: 2 2012 1 efg 1
12: 2 2012 2 efg 1
13: 3 2010 1 def 0
14: 3 2010 2 def 0
15: 3 2011 1 klm 1
16: 3 2011 2 klm 1
17: 3 2012 1 nop 1
18: 3 2012 2 nop 1
当年份也相关时:
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]
可能的基础R替代方案:
f <- function(x) {
x <- rle(x)$lengths
1 * (rep(seq_along(x), times=x) - 1 > 0)
}
samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))
注意:最好不要让对象与函数同名。
使用过的数据:
samp <- data.frame(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))