如何删除data.table中包含相同值的3个连续行

时间:2017-11-07 15:36:57

标签: r data.table

我在R中有一个数据表,其中包含3个功能

DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                                                          "2014-06-23","2014-06-24","2014-06-25","2014-06-26")), 
               Status1 = c("A","B","A","A","B","A","A","A","B","B"))

数据看起来像这样

    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B

如何查看状态1并查看行中是否有3行具有值A(如第6,7,8行),那么我们会删除这些行吗?

3 个答案:

答案 0 :(得分:2)

问题标记为data.table,因此我会尝试给出正确答案:

DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
     sid       date Status1
1:     1 2014-06-22       A
2:     1 2014-06-23       B
3:     2 2014-06-22       A
4:     2 2014-06-23       A
5:     2 2014-06-24       B
6:     3 2014-06-25       B
7:     3 2014-06-26       B

其他测试用例

正如Frank所指出的,我的第一个答案(现在已经编辑过)只是针对OP提供的给定样本数据集工作但是其他测试用例都失败了。

因此,编辑后的代码将应用于其他一些测试用例。

案例B :连续3行字母AB

DT_B <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       B
DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
   sid       date Status1
1:   1 2014-06-22       A
2:   1 2014-06-23       B
3:   2 2014-06-22       A
4:   2 2014-06-23       A
5:   2 2014-06-24       B
6:   3 2014-06-25       B
7:   3 2014-06-26       B
8:   3 2014-06-26       B

仅删除包含字母A(第6行到第8行)的3个连续行。

案例C:无需删除

DT_C <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       C
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       C
DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       C
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       C

没有行被删除,因为没有包含A的3个连续行。

案例D:边缘案例:删除所有行

DT_D <- DT_A[6:8]
DT_D
   sid       date Status1
1:   3 2014-06-22       A
2:   3 2014-06-23       A
3:   2 2014-06-24       A
DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
Empty data.table (0 rows) of 3 cols: sid,date,Status1

删除所有行并返回空的data.table,因为输入data.table只包含3行,字母为A

答案 1 :(得分:1)

with(rle(DT_A$Status1 == "A"), {
    unlist(lapply(which(lengths >= 3), function(i)
        (1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8

答案 2 :(得分:1)

我假设你在你的sid定义中犯了一个错误,并且你的3行所有sid = 3.如果没有,抱歉我的回答不起作用。如果是这种情况,解决方案可以是一行:

 DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]

是一个简单的行,可以满足你的需要:它选择行数小于3或不同于列Status1中的B的数据(这是你想要删除的选择的否定:至少3 A)按sid和Status1分组时。 希望它有所帮助