我当前正在使用R。我有三列需要标识重复项。
这是我正在使用的数据框:
df1 <-data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000),
CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162935.85,163446.89),
Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))
ID_NUMBER CalNumber Date
990 0 8/8/2013 0:00
50000 1126.61 8/16/2008 0:00
52000 1152.24 8/8/2013 0:00
764000 26900.12 8/8/2013 0:00
764000 26900.2 8/16/2008 0:00
764000 26910 8/16/2008
1420000 50673.98 8/16/2008 0:00
1420000 50674.31 8/8/2013 0:00
1470000 52161.18 8/16/2008 0:00
1470000 52161.73 8/8/2013 0:00
2176000 77743.17 8/16/2008 0:00
2176000 77743.7 8/8/2013 0:00
2401000 85593.97 8/16/2008 0:00
2401000 85594.42 8/8/2013 0:00
2667000 94854.76 8/16/2008 0:00
2667000 94855 8/8/2013 0:00
3519000 124033.46 8/8/2013 0:00
3721000 130973.56 8/8/2013 0:00
3721000 130973.59 8/16/2008 0:00
4654000 162935.73 8/16/2008 0:00
4654000 162935.85 8/8/2013 0:00
4685000 163446.89 8/8/2013 0:00
重复项标识如下:如果ID_NUMBER不是唯一的,则减去下面ID_Number组的记录。如果下一个之间的增量小于等于1,则将其视为重复项。优先记录将是该组的最长日期。该组将成为主要组,第二组将被标记为次要组。我的最终结果集将具有两个新标志:isNew和isPrimary。如果不存在重复项,则将其视为新的首次记录。因此,对于非重复记录,isNew将为“ Y”,而isPrimary将为“ Y”。我希望下面的结果示例可以更好地解释我的观点。我是R的新手,所以我不知道从哪里开始。所以任何建议或指针都将不胜感激。
ID_NUMBER CalNumber Date CalcDiff IsNew isPrimary
990 0 8/8/2013 -- Y Y
50000 1126.61 8/16/2008 -- Y Y
52000 1152.24 8/8/2013 -- Y Y
764000 26900.12 8/8/2013 -- N Y
764000 26900.2 8/16/2008 .08 N N
764000 26910 8/16/2008 9.8 Y Y
1420000 50673.98 8/16/2008 -- N N
1420000 50674.31 8/8/2013 .33 N Y
1470000 52161.18 8/16/2008 -- N N
1470000 52161.73 8/8/2013 .55 N Y
2176000 77743.17 8/16/2008 -- N Y
2176000 77743.7 8/8/2013 .53 N N
2401000 85593.97 8/16/2008 -- N N
2401000 85594.42 8/8/2013 .45 N Y
2667000 94854.76 8/16/2008 -- N N
2667000 94855 8/8/2013 .24 N Y
3519000 124033.46 8/8/2013 -- Y Y
3721000 130973.56 8/8/2013 -- N Y
3721000 130973.59 8/16/2008 .03 N N
4654000 162935.73 8/16/2008 -- Y Y
4654000 162936.85 8/8/2013 1.12 Y Y
4685000 163446.89 8/8/2013 -- Y Y
答案 0 :(得分:2)
此解决方案需要dplyr
和magrittr
(对于复合分配管道)。首先,我定义数据框:
df <- data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000),
CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162936.85,163446.89),
Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))
在这里,我将您的Date
转换为日期。然后,我按ID_NUMBER
分组并计算相邻行之间的差异。然后,我使用case_when
应用您的条件来确定IsNew
。最后,我再次按ID_NUMBER
和IsNew
分组,并检查最近的日期。
df %<>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
group_by(ID_NUMBER) %>%
mutate(CalcDiff = c(NA, diff(CalNumber))) %>%
mutate(IsNew = case_when(
n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] <=1 ~ "N",
n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] > 1 ~ "Y",
n() > 1 & CalcDiff <= 1 ~ "N",
n() > 1 & CalcDiff >1 ~ "Y",
TRUE ~ "Y"
)) %>%
group_by(ID_NUMBER, IsNew) %>%
mutate(IsPrimary = case_when(
Date == max(Date) & IsNew == "N" ~ "Y",
Date != max(Date) & IsNew == "N" ~ "N",
TRUE ~ "Y"
))
结果:
# A tibble: 22 x 6
# Groups: ID_NUMBER, IsNew [14]
# ID_NUMBER CalNumber Date CalcDiff IsNew IsPrimary
# <dbl> <dbl> <date> <dbl> <chr> <chr>
# 1 990 0 2013-08-08 NA Y Y
# 2 50000 1127. 2008-08-16 NA Y Y
# 3 52000 1152. 2013-08-08 NA Y Y
# 4 764000 26900. 2013-08-08 NA N Y
# 5 764000 26900. 2008-08-16 0.08 N N
# 6 764000 26910 2008-08-16 9.80 Y Y
# 7 1420000 50674. 2008-08-16 NA N N
# 8 1420000 50674. 2013-08-08 0.330 N Y
# 9 1470000 52161. 2008-08-16 NA N N
# 10 1470000 52162. 2013-08-08 0.55 N Y
# 11 2176000 77743. 2008-08-16 NA N N
# 12 2176000 77744. 2013-08-08 0.530 N Y
# 13 2401000 85594. 2008-08-16 NA N N
# 14 2401000 85594. 2013-08-08 0.450 N Y
# 15 2667000 94855. 2008-08-16 NA N N
# 16 2667000 94855 2013-08-08 0.24 N Y
# 17 3519000 124033. 2013-08-08 NA Y Y
# 18 3721000 130974. 2013-08-08 NA N Y
# 19 3721000 130974. 2008-08-16 0.0300 N N
# 20 4654000 162936. 2008-08-16 NA Y Y
# 21 4654000 162937. 2013-08-08 1.12 Y Y
# 22 4685000 163447. 2013-08-08 NA Y Y