选择两个值之间的子数据帧

时间:2018-03-09 14:12:37

标签: r dataframe

我有一个数据框对象,其中第一列是质量,第二列是丰富。

dd <- read.table(text = "771.55 0
                  772.35 0 
                 772.9 10 
                 773.81 0
                 885.64 0 
                 885.65 110 
                 885.68 313 
                 885.70 313 
                 885.78 71 
                 885.82 0
                 889.12 0
                 889.13 506 
                 885.82 0
                 900.31 0 
                 900.34 10 
                 901.22 1901 
                 902.8 0
                 908.8 0")

我必须在第二列中仅选择具有连续零值(以0开始并以0结尾)的子数据帧,其具有丰度(第二列值)&gt; 100.结果必须是:

list1 <- read.table(text= "885.64 0
                          885.65 110 
                          885.68 313 
                          885.70 313 
                          885.78 71 
                          885.82 0")

list3 <- read.table(text= "889.12 0
                          889.13 506 
                          885.82 0")

...等

有人提出了这个解决方案:

dd[!!ave(dd$V2, c(0, cumsum(diff(dd$V2) == 0)), FUN = function(x) any(x > 100)), ]

它的效果很好,但是当它们的重复丰度值时它也会被激活。而不是削减:

list <- read.table(text= "885.64 0 
             885.65 110 
             885.68 313 
             885.70 313 
             885.78 71 
             885.82 0")

它在系列中间错误地切入:

list <- read.table(text= "885.64 0 
                 885.65 110 
                 885.68 313")

list <- read.table(text= "885.70 313 
                 885.78 71 
                 885.82 0")

3 个答案:

答案 0 :(得分:1)

以下是使用data.table构建分组变量的解决方案:

library("data.table")
dt <- fread(
  "x y
771.55 0
772.35 0 
772.9 10 
773.81 0
885.64 0 
885.65 10 
885.68 313 
885.70 313 
885.78 71 
885.82 0
889.12 0
889.13 506 
885.82 0
900.31 0 
900.34 10 
901.22 1901 
902.8 0
908.8 0")
dt[, ':='(y2=shift(y), y3=shift(y, type="lead"))]
dt[, ':='(start=(y==0 & y3>0), stop=(y==0 & y2>0))]
dt[, group:=(rleid(start, stop)+1)%/%3]
dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
# > dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
#    group      x    y
# 1:     2 885.64    0
# 2:     2 885.65   10
# 3:     2 885.68  313
# 4:     2 885.70  313
# 5:     2 885.78   71
# 6:     2 885.82    0
# 7:     3 889.12    0
# 8:     3 889.13  506
# 9:     3 885.82    0
# 10:    4 900.31    0
# 11:    4 900.34   10
# 12:    4 901.22 1901
# 13:    4 902.80    0

这是一个简短的变体:

dt[, group:=rleidv(y==0 & shift(y)==0) %/%2][, if (.N>2 && max(y)>100) .SD, group]

答案 1 :(得分:1)

 library(data.table) 
 A=setDT(dd)[,group:=cumsum(c(diff(as.numeric(!V2)),0)<0)][,
              b:=any(V2>100),by=group][!!b][,b:=NULL]

 split(A,A$group)
$`2`
       V1  V2 group
1: 885.64   0     2
2: 885.65 110     2
3: 885.68 313     2
4: 885.70 313     2
5: 885.78  71     2
6: 885.82   0     2

$`3`
       V1  V2 group
1: 889.12   0     3
2: 889.13 506     3
3: 885.82   0     3

$`4`
       V1   V2 group
1: 900.31    0     4
2: 900.34   10     4
3: 901.22 1901     4
4: 902.80    0     4
5: 908.80    0     4

答案 2 :(得分:1)

以下是另一个data.table解决方案,该解决方案在联接中使用非等连接和组:

library(data.table)
# coerce to data.table and append row numbers
setDT(dd)[, rn := .I]
# find start and end indices of subsequences from zero to zero
mdt <- dd[, {tmp = .I[V2 == 0]; .(beg = head(tmp, -1L), end = tail(tmp, -1L))}]
# non-equi join of index ranges and group within the join 
# to return only subsequences which fulfill the condition
result <- dd[mdt, on = .(rn >= beg, rn <= end), .SD[any(V2 > 100)], by = .EACHI][
  # return mass, abundance, and group id
  , .(V1, V2, rleid(rn))]

result
        V1   V2 V3
 1: 885.64    0  1
 2: 885.65  110  1
 3: 885.68  313  1
 4: 885.70  313  1
 5: 885.78   71  1
 6: 885.82    0  1
 7: 889.12    0  2
 8: 889.13  506  2
 9: 885.82    0  2
10: 900.31    0  3
11: 900.34   10  3
12: 901.22 1901  3
13: 902.80    0  3

分组变量V3应该足以进行进一步的分组处理。但是,如果需要分离子数据表:

split(result, by = "V3")
$`1`
       V1  V2 V3
1: 885.64   0  1
2: 885.65 110  1
3: 885.68 313  1
4: 885.70 313  1
5: 885.78  71  1
6: 885.82   0  1

$`2`
       V1  V2 V3
1: 889.12   0  2
2: 889.13 506  2
3: 885.82   0  2

$`3`
       V1   V2 V3
1: 900.31    0  3
2: 900.34   10  3
3: 901.22 1901  3
4: 902.80    0  3