Question

目前我正在使用grepl的嵌套ifelse函数来检查数据框中字符串向量的匹配，例如：

# vector of possible words to match
x <- c("Action", "Adventure", "Animation")

# data
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
my_text <- as.data.frame(my_text)

my_text$new_column <- ifelse (
  grepl("Action", my_text$my_text) == TRUE,
  "Action",
  ifelse (
    grepl("Adventure", my_text$my_text) == TRUE,
    "Adventure",
    ifelse (
      grepl("Animation", my_text$my_text) == TRUE,
      "Animation", NA)))

> my_text$new_column
[1] "Animation" NA          "Adventure"

这对于少数元素（例如，这里的三个元素）来说很好，但是当可能的匹配要大得多时（例如，150），我该如何返回？嵌套的ifelse似乎很疯狂。我知道我可以像下面的代码那样一次性地查看多个内容，但这只返回一个逻辑告诉我，只有字符串匹配，而不是匹配的字符串。我想知道匹配的内容（在多个情况下，任何匹配都没问题。

x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)

returns: [1]  TRUE FALSE  TRUE
what i'd like it to return: "Animation" ""(or FALSE) "Adventure"

Answer 1

遵循模式here，base解决方案。

x <- c("ActionABC", "AdventureDEF", "AnimationGHI")

regmatches(x, regexpr("(Action|Adventure|Animation)", x))

stringr有一种更简单的方法

library(stringr)
str_extract(x, "(Action|Adventure|Animation)")

Answer 2

在Benjamin的基础解决方案的基础上，使用lapply，以便在没有匹配时获得字符（0）值。

只是直接在示例代码上使用regmatches，您是否会出现以下错误。

    my_text$new_column <-regmatches(x = my_text$my_text, m = regexpr(pattern = paste(x, collapse = "|"), text = my_text$my_text))

    Error in `$<-.data.frame`(`*tmp*`, new_column, value = c("Animation",  : 
  replacement has 2 rows, data has 3

这是因为只有2个匹配项，它会尝试在具有3行的数据框列中拟合匹配值。

要使用特殊值填充不匹配项以便可以直接执行此操作，我们可以使用lapply。

my_text$new_column <-
lapply(X = my_text$my_text, FUN = function(X){
  regmatches(x = X, m = regexpr(pattern = paste(x, collapse = "|"), text = X))
})

这会将字符（0）放在没有匹配的地方。

Table screenshot

希望这有帮助。

Answer 3

这样做......

my_text$new_column <- unlist(              
                         apply(            
                             sapply(x, grepl, my_text$my_text),
                             1,
                             function(y) paste("",x[y])))

sapply生成一个逻辑矩阵，显示列中每个元素中出现的x个术语。 apply然后逐行遍历，并将x对应TRUE值的所有值粘贴在一起。（它在开头粘贴""以避免NA s并保持输出的长度与原始数据相同。）如果x中有两个术语匹配一行，它们将在输出中粘贴在一起。

从多个字符串的grepl匹配返回匹配的字符串，而不是逻辑

3 个答案: