我的数据看起来像这样:
ID CSEX MID CMOB CYRB 1ST 2ND
1 1 1 1 1991 0 1
2 1 1 7 1989 1 0
3 2 2 1 1985 1 0
4 2 2 11 1985 0 1
5 1 2 9 1994 0 0
6 2 3 4 1992 1 0
7 2 4 2 1992 0 1
8 1 4 10 1983 1 0
ID =儿童ID,CSEX =儿童性别,MID =母亲ID,CMOB =出生月份和CYRB =出生年份,第1名=第一次出生假人,第2名=第二次出生假人。
如果出生在一个家庭(即同一个MID)中的前两个孩子是同一性别,我试图制作一个值为1的虚拟变量。
我试过
Identifiers_age <- Identifiers_age %>% group_by(MPUBID) %>%
mutate(samesex =
as.numeric(((first == 1 & CSEX == 1) & (second == 1 & CSEX == 1))
| (first == 1 & CSEX == 2) & (second == 1 & CSEX ==2))))
但显然这仍然只检查每个个体ID而不是MID的条件,因此返回一个总是取值= 0的虚拟。
由于
编辑预期输出:
ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
1 1 1 1 1991 0 1 1
2 1 1 7 1989 1 0 1
3 2 2 1 1985 1 0 1
4 2 2 11 1985 0 1 1
5 1 2 9 1994 0 0 1
6 2 3 4 1992 1 0 0
7 2 4 2 1992 0 1 0
8 1 4 10 1983 1 0 0
即。对于出生的前两个孩子属于同一性别的家庭中的任何个人,假人SAMESEX = 1
Edit2(我之前展示的只是我做的一个例子,对于真正的数据集调用结构给出的):
CPUBID MPUBID CSEX CMOB CYRB first second
<int> <int> <int> <int> <int> <dbl> <dbl>
1 201 2 2 3 1993 1 0
2 202 2 2 11 1994 0 1
3 301 3 2 6 1981 1 0
4 302 3 2 10 1983 0 1
5 303 3 2 4 1986 0 0
6 401 4 1 8 1980 1 0
7 403 4 2 3 1997 0 1
8 801 8 2 3 1976 1 0
9 802 8 1 5 1979 0 1
10 803 8 2 9 1982 0 0
和str:
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 11512 obs. of 7 variables:
$ CPUBID : int 201 202 301 302 303 401 403 801 802 803 ...
$ MPUBID : int 2 2 3 3 3 4 4 8 8 8 ...
$ CSEX : int 2 2 2 2 2 1 2 2 1 2 ...
$ CMOB : int 3 11 6 10 4 8 3 3 5 9 ...
$ CYRB : int 1993 1994 1981 1983 1986 1980 1997 1976 1979 1982 ...
$ first : num 1 0 1 0 0 1 0 1 0 0 ...
$ second : num 0 1 0 1 0 0 1 0 1 0 ...
答案 0 :(得分:2)
可能有帮助
library(dplyr)
Identifiers_age %>%
group_by(MID) %>%
mutate(ind1 = CSEX *`1ST`,
ind2 = CSEX *`2ND`,
SAMESEX = as.integer(n_distinct(c(ind1[ind1!=0],
ind2[ind2!=0]))==1 & sum(ind1) >0 & sum(ind2) > 0)) %>%
select(-ind1, -ind2)
# ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 1 1991 0 1 1
#2 2 1 1 7 1989 1 0 1
#3 3 2 2 1 1985 1 0 1
#4 4 2 2 11 1985 0 1 1
#5 5 1 2 9 1994 0 0 1
#6 6 2 3 4 1992 1 0 0
#7 7 2 4 2 1992 0 1 0
#8 8 1 4 10 1983 1 0 0
或者可以使用
进行略微紧凑Identifiers_age %>%
group_by(MID) %>%
mutate(SAMESEX = as.integer(n_distinct(c(CSEX * NA^!`1ST`, CSEX * NA^!`2ND`),
na.rm = TRUE)==1 & sum(`1ST`) > 0 & sum(`2ND`) > 0))
Identifiers_age <- structure(list(ID = 1:8, CSEX = c(1L, 1L, 2L, 2L, 1L,
2L, 2L,
1L), MID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), CMOB = c(1L, 7L,
1L, 11L, 9L, 4L, 2L, 10L), CYRB = c(1991L, 1989L, 1985L, 1985L,
1994L, 1992L, 1992L, 1983L), `1ST` = c(0L, 1L, 1L, 0L, 0L, 1L,
0L, 1L), `2ND` = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L)), .Names = c("ID",
"CSEX", "MID", "CMOB", "CYRB", "1ST", "2ND"), class = "data.frame",
row.names = c(NA, -8L))