Question

我有一个大数据框（104029 x 142）。

我希望按多个特定列名称过滤value>0的行。

df
         word abrasive abrasives abrasivefree abrasion slurry solute solution ....
1 composition     -0.2       0.2         -0.3    -0.40    0.2      0.1         0.20 ....
2       ceria      0.1       0.2         -0.4    -0.20   -0.1     -0.2         0.20 ....
3     diamond      0.3      -0.5         -0.6    -0.10   -0.1     -0.2        -0.15 ....
4        acid     -0.1      -0.1         -0.2    -0.15    0.1      0.3         0.20 ....
....

现在我尝试使用filter()函数来做，没关系。

但我认为这种方式对我来说效率不高。

因为我需要定义每个列的名称，所以当我需要维护我的进程时，它会很难。

column_names <- c("agent", "agents", "liquid", "liquids", "slurry", 
                  "solute", "solutes", "solution", "solutions")

df_filter <- filter(df,  agents>0 | agents>0 | liquid>0 | liquids>0 | slurry>0 | solute>0 | 
                    solutes>0 | solution>0 | solutions>0)

df_filter
         word abrasive abrasives abrasivefree abrasion  slurry solute solution ....
1 composition     -0.2       0.2         -0.3    -0.40    0.2      0.1         0.20 ....
2       ceria      0.1       0.2         -0.4    -0.20   -0.1     -0.2         0.20 ....
4        acid     -0.1      -0.1         -0.2    -0.15    0.1      0.3         0.20 ....
....

有没有更有效的方法？

Answer 1

对于您正在测试的条件，此行将返回True / False的向量

filter_condition <- apply(df[ , column_names], 1, function(x){sum(x>0)} )>0

然后你可以使用

df[filter_condition, ]

我确信在dplyr中有更好的东西。

Answer 2

使用dplyr::filter_at()，您可以使用select()式助手来选择某些功能：

library(dplyr)

df_filter <- df %>%
    filter_at(
        # select all the columns that are in your column_names vector
        vars(one_of(column_names))
        # if any of those variables are greater than zero, keep the row
        , any_vars( . > 0)
    )

多列模式有效过滤行

2 个答案: