Question

我有一个包含大约20列和大约10 ^ 7行的数据帧。其中一列是id列的一个因素。我想通过因子级别的字符串表示的属性来过滤行。下面的代码实现了这一点，但在我看来它真的相当不优雅。特别是我不得不创建一个相关ID的向量，而不需要。

有关简化此事的任何建议吗？

library(dplyr)
library(tidyr)
library(gdata)

dat <- data.frame(id=factor(c("xxx-nld", "xxx-jap", "yyy-aus", "zzz-ita")))

europ.id <- function(id) {
  ctry.code <- substring(id, nchar(id)-2)
  ctry.code %in% c("nld", "ita")
}

ids <- levels(dat$id)
europ.ids <- subset(ids, europ.campaign(ids))

datx <- dat %>% filter(id %in% europ.ids) %>% drop.levels

Answer 1

Docendo Discimus在评论中给出了正确答案。首先要解释一下我在不同尝试中遇到的错误

> dat %>% filter(europ.id(id))
Error in nchar(id) : 'nchar()' requires a character vector
Calls: %>% ... filter_impl -> .Call -> europ.id -> substring -> nchar

然后请注意他的解决方案是有效的，因为如果需要，grepl将as.character应用于其参数（来自man：搜索匹配的字符向量，或者可以通过as.character强制转换为字符向量的对象）。如果您使用as.character，也会发生%in%的隐式应用。由于此解决方案也非常高效，我们可以执行以下操作

dat %>% filter(europ.id(as.character(id)) %>% droplevels

或者让它读得更好更新函数

europ.id <- function(id) {
  ids <- as.character(id)
  ctry.code <- substring(ids, nchar(ids)-2)
  ctry.code %in% c("nld", "ita")
}

并使用

dat %>% filter(europ.id(id)) %>% droplevels

读取与我正在寻找的完全一样。

dplyr :: filter与函数的字符串表示形式一起使用

1 个答案: