二元年数据中的列重复

时间:2016-09-07 16:19:03

标签: r function dataframe dplyr

正如标题所示,我有一个二元年数据。问题是我(由于某种原因......)重复的二元列名称 - 例如,如下所示,A到A和B到B的观察没有意义。真实数据超过70.000次观测。

我想要做的是生成一个虚拟变量,指示相同的二元观测值。

PERSON1     PERSON2      year     
   A           A          1990    
   A           A          1991    
   A           A          1992    
   A           B          1990    
   A           B          1991    
   A           B          1992   
   A           C          1990   
   A           C          1991   
   A           C          1992    
   B           B          1990    
   B           B          1991    
   B           B          1992    
   ...

函数duplicated()与其他基本R命令一起没有帮助,因为它是二进制数据。

这是可重现的例子

structure(list(PERSON1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "G"), class = "factor"), 
    PERSON2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
    1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 
    3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    year = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 
    1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 
    1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 
    1991L, 1992L)), .Names = c("PERSON1", "PERSON2", "year"), class = "data.frame", row.names = c(NA, 
-27L))

所需的输出(重复虚拟)

PERSON1 PERSON2 year    duplicate
A          A    1990    1
A          A    1991    1
A          A    1992    1
A          B    1990    0
A          B    1991    0
A          B    1992    0
A          C    1990    0
A          C    1991    0
A          C    1992    0
B          A    1990    0
B          A    1991    0
B          A    1992    0
B          B    1990    1
B          B    1991    1
B          B    1992    1

1 个答案:

答案 0 :(得分:0)

我们可以通过比较'PERSON1'和'PERSON2'

轻松完成
setDT(df1)[, duplicate := as.integer(as.character(PERSON1) == as.character(PERSON2))]
 head(df1, 15)
#    PERSON1 PERSON2 year duplicate
# 1:       A       A 1990         1
# 2:       A       A 1991         1
# 3:       A       A 1992         1
# 4:       A       B 1990         0
# 5:       A       B 1991         0
# 6:       A       B 1992         0
# 7:       A       C 1990         0
# 8:       A       C 1991         0
# 9:       A       C 1992         0
#10:       B       A 1990         0
#11:       B       A 1991         0
#12:       B       A 1992         0
#13:       B       B 1990         1
#14:       B       B 1991         1
#15:       B       B 1992         1

或使用base R

transform(df1, duplicate = as.integer(as.character(PERSON1)== as.character(PERSON2)))