Question

我有看起来像这样的数据

data <- data.frame(
  ID_num = c("BGR9876", "BNG3421", "GTH4567", "YOP9824", "Child 1", "2JAZZ", "TYH7654"),
  date_created = "19/07/1983"
)

我想过滤数据框，以便仅保留ID_num遵循ABC1234模式的行。我是在grep中使用正则表达式的新手，但我弄错了。这就是我正在尝试的

data_clean <- data %>%
  filter(grep("[A-Z]{3}[1:9]{4}", ID_num))

哪个给我错误Error in filter_impl(.data, quo) : Argument 2 filter condition does not evaluate to a logical vector

这是我想要的输出

data_clean <- data.frame(
  ID_num = c("BGR9876", "BNG3421", "GTH4567", "YOP9824", "TYH7654"),
  date_created = "19/07/1983"
)

谢谢

Answer 1

1:9应该是1-9，它将与grepl一起是^，以指定字符串的开头，而$是字符串的结尾。字符串

library(dplyr)
data %>%
   filter(grepl("^[A-Z]{3}[1-9]{4}$", ID_num))
#   ID_num date_created
#1 BGR9876   19/07/1983
#2 BNG3421   19/07/1983
#3 GTH4567   19/07/1983
#4 YOP9824   19/07/1983
#5 TYH7654   19/07/1983

filter需要逻辑向量，grep返回数字索引，而grepl返回逻辑向量

或者，如果我们要使用grep，请使用slice，它需要数字索引

data %>%
   slice(grep("^[A-Z]{3}[1-9]{4}$", ID_num))

tidyverse中的一个类似选项是使用str_detect

library(stringr)
data %>%
    filter(str_detect(ID_num, "^[A-Z]{3}[1-9]{4}$"))

在base R中，我们可以做到

subset(data, grepl("^[A-Z]{3}[1-9]{4}$", ID_num))

或与Extract

data[grepl("^[A-Z]{3}[1-9]{4}$", data$ID_num),]

请注意，这将专门查找3个大写字母后跟4个数字且不匹配的模式

grepl("[A-Z]{3}[1-9]{4}", "ABGR9876923")
#[1] TRUE

grepl("^[A-Z]{3}[1-9]{4}$", "ABGR9876923")
#[1] FALSE

Answer 2

我们可以将grepl与模式一起使用

data[grepl("[A-Z]{3}\\d{4}", data$ID_num), ]

#   ID_num date_created
#1 BGR9876   19/07/1983
#2 BNG3421   19/07/1983
#3 GTH4567   19/07/1983
#4 YOP9824   19/07/1983
#7 TYH7654   19/07/1983

或者在filter

中

library(dplyr)
data %>% filter(grepl("[A-Z]{3}\\d{4}", ID_num))

使用R中的grep过滤数据框中与正则表达式匹配的变量中的值

2 个答案: