Question

我有一个数据框和一个我想要应用于数据的预测模型。但是，我想过滤掉模型可能不适用的记录。为此，我有另一个数据帧，其中包含每个变量在训练数据中观察到的最小值和最大值。我想从我的新数据中删除那些一个或多个值超出指定范围的记录。

为了清楚我的问题，我的数据可能是这样的：

  id   x       y     
 ---- ---- --------- 
   1    2     30521  
   2   -1      1835  
   3    5     25939  
   4    4   1000000

这是我的第二张表，包括分钟和最大值，看起来像：

  var   min    max   
 ----- ----- ------- 
  x       1       5  
  y       0   99999

在这个例子中，我想在我的数据中标记以下记录：2（低于x的最小值）和4（高于y的最大值）。

我怎么能在R中轻松做到这一点？我有预感，有一些聪明的dplyr代码可以完成这项任务，但我不知道它会是什么样的。

Answer 1

您的数据如下：

df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10))
limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))

您可以将sweep功能与操作员'>'和'<'一起使用，这非常简单！

sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<')
####          x     y
#### [1,]  TRUE  TRUE
#### [2,] FALSE  TRUE
#### [3,] FALSE FALSE
#### [4,]  TRUE FALSE
#### [5,] FALSE FALSE
#### [6,] FALSE  TRUE

TRUE位置告诉您要为每个变量保留哪些观察结果。它适用于任意数量的变量

之后如果你需要全局标志（至少在一列中有标志），你可以运行这个简单的行（res是前一个输出）

apply(res, 1, all)
#### [1]  TRUE FALSE FALSE FALSE FALSE FALSE

Answer 2

不是很优雅，但无论如何：

df <- read.table(header=T, text="  id   x       y     
   1    2     30521  
   2   -1      1835  
   3    5     25939  
   4    4   1000000 ") 
df
ranges <- read.table(header=T, text="  var   min    max   
  x       1       5  
  y       0   99999")

ranges <- ranges[match(ranges[,1], names(df)[-1]), ] # sort ranges, if necessary
matrixStats::rowAnys(
  !sapply(seq_along(df)[-1], function(x) {
    df[,x]>=ranges[x-1,2] & df[,x]<=ranges[x-1,3]
  })
) -> df$flag
df$flag
# [1] FALSE  TRUE FALSE  TRUE

Answer 3

与dplyr相似：

library(dplyr)
df <- read.table(text = "  id   x       y     
           1    2     30521  
           2   -1      1835  
           3    5     25939  
           4    4   1000000  ", header = TRUE)


dfilte <- read.table(text = "  var   min    max
  x       1       5  
  y       0   99999  ", header = TRUE)


df  %>% mutate(flag_x = x %in% dfilte[1, -1],
               flax_y = y %in% dfilte[2, -1])

产生此输出的

：

  id  x       y flag_x flax_y
1  1  2   30521  FALSE  FALSE
2  2 -1    1835  FALSE  FALSE
3  3  5   25939   TRUE  FALSE
4  4  4 1000000  FALSE  FALSE

Answer 4

我认为您的问题非常适合在基础R中使用cut函数：

df$to.remove <- is.na(cut(df$x, breaks = ranges[1,][,-1])) | 
                is.na(cut(df$y, breaks = ranges[2,][,-1]))

#  id  x       y to.remove
#1  1  2   30521     FALSE
#2  2 -1    1835      TRUE
#3  3  5   25939     FALSE
#4  4  4 1000000      TRUE

is.na(...)将为您提供一个逻辑向量，其中超出指定范围的值为TRUE。最后，您应用|，即or运算符来决定哪些必须删除。

要清理数据，您只需要执行此操作：

df <- df[!df$to.remove,]

修改

我刚刚注意到（根据您的评论）您的数据框包含的变量多于x和y。在这种情况下，您可以定义一个名为f的函数，并对数据框中的变量执行以下操作。

f <- function(x, xrange, y, yrange) { (is.na(cut(x, breaks = xrange)) | is.na(cut(y, breaks = yrange)))} res <- f(df$x, ranges[1,][-1], df$y, ranges[2,][-1])

数据

df <- structure(list(id = 1:4, x = c(2L, -1L, 5L, 4L), y = c(30521L, 1835L, 25939L, 1000000L)), .Names = c("id", "x", "y"), class = "data.frame", row.names = c(NA, -4L)) ranges <- structure(list(var = structure(1:2, .Label = c("x", "y"), class = "factor"), min = c(1L, 0L), max = c(5L, 99999L)), .Names = c("var", "min", "max"), class = "data.frame", row.names = c(NA, -2L))

Answer 5

不能真正理解您想要的输出，但这适用于任何范围和任意数量的数据：

> df

  id  x       y
1  1  2   30521
2  2 -1    1835
3  3  5   25939
4  4  4 1000000


#I transpose your filter data frame so its easier to work with.
> dfFilter

    x     y
min 1     0
max 5 99999

然后，您可以根据dfFilter中的范围应用过滤器：

#Flag original dataframe with values between the minimum x and maximum x 

   df$flag_x=ifelse(df$x > min(dfFilter$x) & df$x < max(dfFilter$x), "yes","no")


#Flag original dataframe with values between the minimum y and maximum y

   df$flag_y=ifelse(df$y > min(dfFilter$y) & df$y < max(dfFilter$y), "yes","no")

所以输出看起来像这样：

  id  x       y flag_x flag_y
1  1  2   30521    yes    yes
2  2 -1    1835     no    yes
3  3  5   25939     no    yes
4  4  4 1000000    yes    yes

当然，您可以更改此过滤器或对其执行任何数学运算，以便获得所需的输出（例如x-2的最小值：min(dfFilter$x)-2）。

希望它有效。

如何从数据框中删除超出变量特定范围的记录？ [R]

5 个答案: