我有一个数据框对象,其中第一列是质量,第二列是丰富。
dd <- read.table(text = "771.55 0
772.35 0
772.9 10
773.81 0
885.64 0
885.65 110
885.68 313
885.70 313
885.78 71
885.82 0
889.12 0
889.13 506
885.82 0
900.31 0
900.34 10
901.22 1901
902.8 0
908.8 0")
我必须在第二列中仅选择具有连续零值(以0开始并以0结尾)的子数据帧,其具有丰度(第二列值)&gt; 100.结果必须是:
list1 <- read.table(text= "885.64 0
885.65 110
885.68 313
885.70 313
885.78 71
885.82 0")
list3 <- read.table(text= "889.12 0
889.13 506
885.82 0")
...等
有人提出了这个解决方案:
dd[!!ave(dd$V2, c(0, cumsum(diff(dd$V2) == 0)), FUN = function(x) any(x > 100)), ]
它的效果很好,但是当它们的重复丰度值时它也会被激活。而不是削减:
list <- read.table(text= "885.64 0
885.65 110
885.68 313
885.70 313
885.78 71
885.82 0")
它在系列中间错误地切入:
list <- read.table(text= "885.64 0
885.65 110
885.68 313")
list <- read.table(text= "885.70 313
885.78 71
885.82 0")
答案 0 :(得分:1)
以下是使用data.table
构建分组变量的解决方案:
library("data.table")
dt <- fread(
"x y
771.55 0
772.35 0
772.9 10
773.81 0
885.64 0
885.65 10
885.68 313
885.70 313
885.78 71
885.82 0
889.12 0
889.13 506
885.82 0
900.31 0
900.34 10
901.22 1901
902.8 0
908.8 0")
dt[, ':='(y2=shift(y), y3=shift(y, type="lead"))]
dt[, ':='(start=(y==0 & y3>0), stop=(y==0 & y2>0))]
dt[, group:=(rleid(start, stop)+1)%/%3]
dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
# > dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
# group x y
# 1: 2 885.64 0
# 2: 2 885.65 10
# 3: 2 885.68 313
# 4: 2 885.70 313
# 5: 2 885.78 71
# 6: 2 885.82 0
# 7: 3 889.12 0
# 8: 3 889.13 506
# 9: 3 885.82 0
# 10: 4 900.31 0
# 11: 4 900.34 10
# 12: 4 901.22 1901
# 13: 4 902.80 0
这是一个简短的变体:
dt[, group:=rleidv(y==0 & shift(y)==0) %/%2][, if (.N>2 && max(y)>100) .SD, group]
答案 1 :(得分:1)
library(data.table)
A=setDT(dd)[,group:=cumsum(c(diff(as.numeric(!V2)),0)<0)][,
b:=any(V2>100),by=group][!!b][,b:=NULL]
split(A,A$group)
$`2`
V1 V2 group
1: 885.64 0 2
2: 885.65 110 2
3: 885.68 313 2
4: 885.70 313 2
5: 885.78 71 2
6: 885.82 0 2
$`3`
V1 V2 group
1: 889.12 0 3
2: 889.13 506 3
3: 885.82 0 3
$`4`
V1 V2 group
1: 900.31 0 4
2: 900.34 10 4
3: 901.22 1901 4
4: 902.80 0 4
5: 908.80 0 4
答案 2 :(得分:1)
以下是另一个data.table
解决方案,该解决方案在联接中使用非等连接和组:
library(data.table)
# coerce to data.table and append row numbers
setDT(dd)[, rn := .I]
# find start and end indices of subsequences from zero to zero
mdt <- dd[, {tmp = .I[V2 == 0]; .(beg = head(tmp, -1L), end = tail(tmp, -1L))}]
# non-equi join of index ranges and group within the join
# to return only subsequences which fulfill the condition
result <- dd[mdt, on = .(rn >= beg, rn <= end), .SD[any(V2 > 100)], by = .EACHI][
# return mass, abundance, and group id
, .(V1, V2, rleid(rn))]
result
V1 V2 V3 1: 885.64 0 1 2: 885.65 110 1 3: 885.68 313 1 4: 885.70 313 1 5: 885.78 71 1 6: 885.82 0 1 7: 889.12 0 2 8: 889.13 506 2 9: 885.82 0 2 10: 900.31 0 3 11: 900.34 10 3 12: 901.22 1901 3 13: 902.80 0 3
分组变量V3
应该足以进行进一步的分组处理。但是,如果需要分离子数据表:
split(result, by = "V3")
$`1` V1 V2 V3 1: 885.64 0 1 2: 885.65 110 1 3: 885.68 313 1 4: 885.70 313 1 5: 885.78 71 1 6: 885.82 0 1 $`2` V1 V2 V3 1: 889.12 0 2 2: 889.13 506 2 3: 885.82 0 2 $`3` V1 V2 V3 1: 900.31 0 3 2: 900.34 10 3 3: 901.22 1901 3 4: 902.80 0 3