Question

两个相关问题：

编辑： ps。我正在寻找基于data.table的解决方案。

1。如何为所有列选择高于特定阈值的data.table的行？

2. 如何选择包含超过特定阈值的data.table的列？

可重复的例子：

library(data.table)
dt <- data.table(V1=1:5, V2=3:7, V3=7:3)

有条件地选择所有行

# this line selects rows based on column `V1`. 
  dt[ V1 > 2, ] 

# I'm looking for a way to select rows based on values of all columns. My failed attempt
  dt[ names(dt) > 2, ] 

# *expected output*: a data.table with all columns but only with those rows where all values are `> 2` 

#> V1 V2 V3
#> 3  5  5
#> 4  6  4
#> 5  7  3

有条件地选择所有列

# My failed attempt
  dt[, .SD, .SDcols > 2 ]

# *expected output*: a data.table with all rows but only with those columns where all values are `> 2`

#>   V2 V3
#>   3  7
#>   4  6
#>   5  5
#>   6  4
#>   7  3

Answer 1

要获取所有列但只包含所有值都高于阈值的行，最好的方法是使用经典过滤：

dt[rowMeans(dt>threshold)==1,]

要获取所有行但只包含所有值都高于阈值的列，您可以执行以下操作：

dt[,colMeans(dt>threshold)==1, with=F]

Answer 2

对于子行化，以下代码使用基数R.因为您要查看行，所以您将数据表视为矩阵。

rws <- apply(dt, 1L, function(r) any(r > 4))
dt[rws]

对于列，您可以再次使用数据表的类似列表的属性：

cls <- sapply(dt, function(c) any(c > 4))
dt[, cls, with = FALSE]

Answer 3

可选地，比rowMeans更复杂的解决方案，但提供更大的灵活性。使用lhs.all辅助函数为所提供表达式的 LHS 中的所有字段循环表达式。

library(data.table)
dt = data.table(V1=1:5, V2=3:7, V3=7:3)

lhs.all = function(pseudo.expr) {
    sub.pseudo.expr = substitute(pseudo.expr)
    stopifnot(is.call(sub.pseudo.expr), is.character(cols <- eval.parent(sub.pseudo.expr[[2L]])))
    l.expr = lapply(cols, function(x) {
        sub.expr=sub.pseudo.expr
        sub.expr[[2L]] = as.name(x)
        sub.expr
    })
    Reduce(function(a, b) bquote(.(a) & .(b)), l.expr)
}
lhs.all(names(dt) > 2)
#V1 > 2 & V2 > 2 & V3 > 2
dt[eval(lhs.all(names(dt) > 2))]
#   V1 V2 V3
#1:  3  5  5
#2:  4  6  4
#3:  5  7  3

选择所有列|基于条件

有条件地选择所有行

有条件地选择所有列

3 个答案: