Question

我想基于非累积行数和更多条件在r中对数据帧进行子集化。

例如，我有以下数据框：

x<-data.frame(x1=c(1,2,3,4,5,6,7,8,9),x2=c(70,1,6,23,98,21,45,8,6))

现在我想用2个条件将x子集化：

x2之和必须小于60。
x1必须大于2。

所以我尝试了：

subset(x, cumsum(x2)<60 & x1>2)

很明显，我的代码无法正常工作（它返回一个空的数据帧），因为我试图使用cumsum，并且x2的第一个元素已经大于60。

我希望一个数据帧看起来像这样：

因为x2值的总和小于60并且x1的值大于2。

由于解决方案是动态的，因此另一个可能的结果是：

或者：

  x1 x2
3  3  6

一旦我了解如何实现它，我将通过添加更多条件来限制可能的解决方案集。

Ronak Shah的编辑

附加列x3，因此数据框x变为：

x<-data.frame(x1=c(1,2,3,4,5,6,7,8,9),x2=c(70,1,6,23,98,21,45,8,6),x3=c(13,2,31,45,5,6,7,18,0))

x3的总和应小于20，因此x3_tresh应为20。

解决方案已修改

subset_df_row <- function(x, x1_value, x2_thresh, x3_thresh) {
  #Filter the dataframe based on x1_value
  df1 <- x[x$x1 > x1_value, ]
  #Shuffle rows to get random result
  df1 <- df1[sample(seq_len(nrow(df1))), ]
  #If the first value of x2 is greater than threshold shuffle again
  while(df1$x2[1] >= x2_thresh || df1$x3[1] >= x3_thresh) {
    df1 <- df1[sample(seq_len(nrow(df1))), ]
  }
  #Return the subset
  df1[1 : min((which.max(cumsum(df1$x2) >= x2_thresh) - 1),
              (which.max(cumsum(df1$x3) >= x3_thresh) - 1)), ]
}

Answer 1

我们可以编写一个函数来对数据框进行子集

subset_df_row <- function(x, x1_value, x2_thresh) {
    #Filter the dataframe based on x1_value
    df1 <- x[x$x1 > x1_value, ]
    #Shuffle rows to get random result
    df1 <- df1[sample(seq_len(nrow(df1))), ]
    #If the first value of x2 is greater than threshold shuffle again
    while(df1$x2[1] >= x2_thresh) {
      df1 <- df1[sample(seq_len(nrow(df1))), ]
    }
    #Return the subset
    df1[1 : (which.max(cumsum(df1$x2) >= x2_thresh) - 1), ]
}

然后动态传递x1和x2过滤器值

subset_df_row(x, 2, 60)
#  x1 x2
#6  6 21
#8  8  8

subset_df_row(x, 3, 160)
#  x1 x2
#8  8  8
#5  5 98
#4  4 23

基于非累积行总和的数据帧子集

1 个答案: