使用grepl创建基于另一列的列

时间:2020-06-08 14:29:57

标签: r string dataframe grepl startswith

让我们考虑一个df,其中有两列wordstem。我想创建一个新列,以检查stem中是否包含word中的值,以及该值是在其他字符之前还是之后。最终结果应如下所示:

WORD     STEM     NEW
rerun    run      prefixed
runner   run      suffixed
run      run      none
...      ...      ...

在下面,您可以看到到目前为止的代码。但是,由于grepl表达式应用于df的所有行,因此它不起作用。无论如何,我认为这应该使我的想法更清楚。

df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
             ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
                ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))

3 个答案:

答案 0 :(得分:2)

您可以使用mapply每行使用grepl,例如:

ifelse(mapply(grepl, paste0(".+", x$STEM, ".+"), x$WORD), "both",
ifelse(mapply(grepl, paste0(x$STEM, ".+"), x$WORD), "suffixed",
ifelse(mapply(grepl, paste0(".+", x$STEM), x$WORD), "prefixed", "none")))
#"prefixed" "suffixed"     "none" 

或者使用startsWithendsWith并使用子集形式向量:

c("none", "both", "prefixed", "suffixed")[1 + (1 + startsWith(x$WORD, x$STEM) +
 2*endsWith(x$WORD, x$STEM)) * (nchar(x$WORD) > nchar(x$STEM) &
 mapply(grepl, x$STEM, x$WORD))]
#[1] "suffixed" "prefixed" "none"    

答案 1 :(得分:1)

您可以像这样创建new

df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                 ifelse(startsWith(df$word, df$stem), 'suffixed',
                        ifelse(endsWith(df$word, df$stem), 'prefixed',
                               'both')))

或者,在您处于dplyr管道中的情况下,您要避免所有烦人的df$

df %>%
  mutate(new = ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                      ifelse(startsWith(df$word, df$stem), 'suffixed',
                             ifelse(endsWith(df$word, df$stem), 'prefixed',
                                    'both'))))

输出

#       word stem     new1
# 1    rerun  run prefixed
# 2   runner  run suffixed
# 3      run  run     none
# 4    aruna  run     both

答案 2 :(得分:1)

这里是str_locatestringr中的dplyr的一种方法:

library(dplyr)
library(stringr)
data %>%
  mutate_at(vars(WORD,STEM), as.character) %>%
  mutate(NEW = 
         case_when(str_locate(WORD,STEM)[,"start"] > 1 &
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
                   str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
                   TRUE ~ "none"))
    WORD STEM      NEW
1  rerun  run prefixed
2 runner  run suffixed
3    run  run     none

我添加了一行以将WORDSTEM转换为字符,以防万一。