我有关键字列表
keywords=c("Minister", "President","Secretary")
我有一个在不同行中有不同文本的列
column=c("he is general Secretary of Ozon group", "He is vice president of
our college", "He is health minister", "He is education minister")
有没有办法根据关键字提取列中存在的变体?
我正在寻找的输出是
output=c("general Secretary","vice president", "education minister", "health minister")
答案 0 :(得分:0)
如果您尝试提取关键字+任何前面的单词,您可以这样做:
pat <- paste0("\\w+\\s(", paste(keywords, collapse = "|"), ")")
regmatches(column, gregexpr(pat, column, ignore.case = TRUE))
#[[1]]
#[1] "general Secretary"
#
#[[2]]
#[1] "vice president"
#
#[[3]]
#[1] "health minister"
#
#[[4]]
#[1] "education minister"
或使用stringr
library(stringr)
pat <- paste0("\\w+\\s(", paste(tolower(keywords), collapse = "|"), ")")
str_extract_all(tolower(column), pat)