Question

如果多个列中的内容不匹配变化的用户条件，我想从数据框中删除行。

以下repex应该涵盖我要实现的目标的复杂性。

install.packages("dplyr")
install.packages("purrr")

options(stringsAsFactors=FALSE)

# Create user criteria (UC) data.
UC <- data.frame(
  Series = 1:5, 
  Unit = c("cm","mm",NA,NA,"cm"),
  Month = c(NA,NA,"Jan",NA,"Feb"), 
  Height = c(3,NA,NA,3,1)
)

# Create range of scenarios (RS) but only consider two series initially.
set.seed(2)
num_series <- 2
RS <- data.frame(
  Series = sample(c(1:5), num_series, replace=TRUE),
  Unit = sample(c("cm","mm"), num_series, replace=TRUE),
  Month = sample(c("Jan","Feb","Mar","Apr"), num_series, replace=TRUE), 
  Height = sample(c(1:3), num_series, replace=TRUE) 
)

# Identify applicable critera for matching (AC).
AC <- dplyr::filter(UC, UC$Series %in% unique(RS$Series)) 
AC <- AC[, !purrr::map_lgl(AC, ~all(is.na(.))), drop=FALSE]

# Combine the scenario data and the applicable criteria.
SC <- merge(x=RS, y=AC, by="Series", all.x=TRUE) 

# Function to identify rows for removal.
fn_remove_row <- function(cols, rm) {
  x <- paste0(cols,".x")
  y <- paste0(cols,".y")
  rm$remove <- ifelse(rm$remove == 0 & !is.na(rm[[y]]) & rm[[y]] != rm[[x]], 1, rm$remove)
  rm[[y]] <- NULL
  setnames(rm, eval(substitute(x)), unlist(cols))
}

# Identify columns to be considered for matching for the given scenarios.
cols <- as.list(gsub("\\.y","",grep("\\.y", names(SC), value=TRUE)))

# Final dataframe (with option to filter).
SC$remove <- 0 # Initial values.
df <- cbind(data.frame(lapply(cols, fn_remove_row, rm=SC)))
#df <- dplyr::filter(df, remove == 0)

关于我的代码的一些解释：

•UC数据框提供了一些示例规则，这些规则可能会或可能不会应用于特定列，具体取决于数据中观察到的序列。

•尽管我最初仅考虑两个方案，但RS数据帧创建了需要满足的不同数据方案。可以增加num_series参数以提供更多方案。

•AC数据框将用户条件作为子集，以仅选择适用于观察到的方案的列。

•SC数据框组合了方案数据和适用的条件。带有要应用规则的列将带有后缀.x（原始数据）和后缀.y（标准）。

•我创建了一个函数来依次考虑所需的列并检查值是否匹配。如果它们不匹配，则该行将被标记为“ 1”以指示将其删除。如果特定列的标准值缺失（NA），则在这种情况下无需进行匹配。进行检查后，条件列将被删除，原始数据列将被重命名以删除后缀。

•我使用lapply创建一个最终数据帧（df），其中包含要过滤的列。当前未应用该过滤器，因为标记创建不正确。

（使用seed = 2创建的）输入数据帧为：

> UC                                  > RS
  Series Unit Month   Height            Series Unit Month Height
     1    cm   <NA>      3                 1    mm   Apr     1
     2    mm   <NA>     NA                 4    cm   Apr     3
     3   <NA>   Jan     NA
     4   <NA>  <NA>      3
     5    cm    Feb      1

由于RS包含系列1和4，因此创建了AC也包含这些系列，并且仅保留了适用的列：

> AC
  Series Unit Height
     1    cm     3
     4   <NA>    3

合并将根据需要合并RS和AC，并初始化remove标志：

> SC
  Series Unit.x Month Height.x Unit.y Height.y remove
     1     mm    Apr      1     cm       3       0
     4     cm    Apr      3    <NA>      3       0

在这种情况下，我要标记系列1以便删除，因为Unit.x不等于Unit.y，但是如果它们确实匹配，它将仍然被标记，因为Height.x不等于{{1 }}。月份列不会进入方程式，因为这两个系列没有适用的标准。

第4系列不会被标记，因为单位比较不适用（Height.y = Unit.y），并且高度比较给出了匹配。

最后（过滤之前）：

NA

尽管没有> df Series Unit Month Height remove 1 mm Apr 1 1 4 cm Apr 3 0调用和各种未显示的lapply尝试，但我从return()得到的是重复的列：

cbind

> df Series Unit Month Height.x Height.y remove Series.1 Unit.x Month.1 Height Unit.y remove.1 1 mm Apr 1 3 1 1 mm Apr ...是错误的函数，无法通过适用的列循环吗？在我看来，只是缺少了一个很小的关键元素。

完整的解决方案测试应使用不同的种子并增加lapply。

Answer 1

似乎是使用for-loop而不是我们可信赖的lapply朋友的好时机：

# Function to identify rows for removal.
fn_remove_row <- function(col, rm) {
  x <- paste0(col,".x")
  y <- paste0(col,".y")
  rm$remove <- ifelse(rm$remove == 0 & !is.na(rm[[y]]) & rm[[y]] != rm[[x]], 1, rm$remove)
  rm[[y]] <- NULL
  setnames(rm, eval(substitute(x)), unlist(col))
  return(rm)
}

# Identify columns to be considered for matching for the given scenarios.
cols <- c(gsub("\\.y","",grep("\\.y", names(SC), value=TRUE)))

# Final dataframe (with option to filter).
SC$remove <- 0 # Initial values.
for (i in 1:length(cols)) {
  col <- cols[i]
  SC <- fn_remove_row(col, SC)
}

Answer 2

以下是解决此问题的一些不同/矢量化方法。我已尝试通读并理解，并希望已理解您到底想做什么。

x <- paste0(cols,".x")
y <- paste0(cols,".y")
SC$remove <- as.integer(rowSums(!is.na(SC[y]) & SC[x] != SC[y]) > 0)
SC[y] <- NULL
names(SC)[names(SC) %in% x] <- cols

SC
#  Series Unit Month Height remove
#1      1   mm   Apr      1      1
#2      4   cm   Apr      3      0

您可以将其扩展为许多列，并在需要时将其包装在函数中。

根据要应用于多列的规则来标识要删除的行

2 个答案: