正如标题所示,我有一个二元年数据。问题是我(由于某种原因......)重复的二元列名称 - 例如,如下所示,A到A和B到B的观察没有意义。真实数据超过70.000次观测。
我想要做的是生成一个虚拟变量,指示相同的二元观测值。
PERSON1 PERSON2 year
A A 1990
A A 1991
A A 1992
A B 1990
A B 1991
A B 1992
A C 1990
A C 1991
A C 1992
B B 1990
B B 1991
B B 1992
...
函数duplicated()
与其他基本R命令一起没有帮助,因为它是二进制数据。
这是可重现的例子
structure(list(PERSON1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "G"), class = "factor"),
PERSON2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
year = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L,
1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L,
1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L,
1991L, 1992L)), .Names = c("PERSON1", "PERSON2", "year"), class = "data.frame", row.names = c(NA,
-27L))
所需的输出(重复虚拟)
PERSON1 PERSON2 year duplicate
A A 1990 1
A A 1991 1
A A 1992 1
A B 1990 0
A B 1991 0
A B 1992 0
A C 1990 0
A C 1991 0
A C 1992 0
B A 1990 0
B A 1991 0
B A 1992 0
B B 1990 1
B B 1991 1
B B 1992 1
答案 0 :(得分:0)
我们可以通过比较'PERSON1'和'PERSON2'
轻松完成setDT(df1)[, duplicate := as.integer(as.character(PERSON1) == as.character(PERSON2))]
head(df1, 15)
# PERSON1 PERSON2 year duplicate
# 1: A A 1990 1
# 2: A A 1991 1
# 3: A A 1992 1
# 4: A B 1990 0
# 5: A B 1991 0
# 6: A B 1992 0
# 7: A C 1990 0
# 8: A C 1991 0
# 9: A C 1992 0
#10: B A 1990 0
#11: B A 1991 0
#12: B A 1992 0
#13: B B 1990 1
#14: B B 1991 1
#15: B B 1992 1
或使用base R
transform(df1, duplicate = as.integer(as.character(PERSON1)== as.character(PERSON2)))