按组选择每次运行的第一行

时间:2021-05-22 13:16:28

标签: r duplicates sequence run-length-encoding

我有一个带有分组变量 (ID) 和一些值(类型)的数据:

ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type <- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")

dat <- data.frame(ID,type)

在每个ID中,我想删除重复的数字,不是唯一的,而是与前一个相同的。我已经注释了一些例子:

#     ID type
#  1   1    1
#  2   1    3 # first value in a run of 3s within ID 1: keep 
#  3   1    3 # 2nd value: remove  
#  4   1    2
#  5   2    3
#  6   2    3
#  7   2    1
#  8   2    1
#  9   3    1
# 10   3    2 # first value in a run of 2s within ID 3: keep
# 11   3    2 # 2nd value: remove
# 12   3    1

例如,ID 3 的值序列为 1, 2, 2, 1。第三个值与第二个值相同,因此应将其删除,变为 1,2,1

因此,所需的输出是:

data.frame(ID = c("1", "1", "1", "2", "2", "3", "3", "3"),
           type = c("1", "3", "2", "3", "1", "1", "2", "1"))

  ID type
1  1    1
2  1    3
3  1    2
4  2    3
5  2    1
6  3    1
7  3    2
8  3    1

我试过了

 df[!duplicated(df), ]

然而我得到的是

ID <- c("1", "1", "1", "2", "2", "3", "3")
type<- c("1", "3", "2", "3", "1", "1", "2")

我知道重复只会保留唯一的。我怎样才能得到我想要的值?

提前感谢您的帮助!

2 个答案:

答案 0 :(得分:1)

使用 data.table rleidduplicated -

library(data.table)
setDT(dat)[!duplicated(rleid(ID, type))]

#   ID type
#1:  1    1
#2:  1    3
#3:  1    2
#4:  2    3
#5:  2    1
#6:  3    1
#7:  3    2
#8:  3    1

改进了答案,包括来自@Henrik 的建议。

答案 1 :(得分:1)

Base R way 如果你只想消除连续重复的行(8行输出)

main = do
    x <- (read . head ) <$> getArgs
    print $ <particular sum function> [1..x]

reprex package (v2.0.0) 于 2021 年 5 月 22 日创建