从R列中提取字符串的变体

时间:2018-06-18 11:00:33

标签: r nlp text-mining

我有关键字列表

keywords=c("Minister", "President","Secretary")

我有一个在不同行中有不同文本的列

column=c("he is general Secretary of Ozon group", "He is vice president of 
our college", "He is health minister", "He is education minister")

有没有办法根据关键字提取列中存在的变体?

我正在寻找的输出是

output=c("general Secretary","vice president", "education minister", "health minister")

1 个答案:

答案 0 :(得分:0)

如果您尝试提取关键字+任何前面的单词,您可以这样做:

pat <- paste0("\\w+\\s(", paste(keywords, collapse = "|"), ")")
regmatches(column, gregexpr(pat, column, ignore.case = TRUE))
#[[1]]
#[1] "general Secretary"
#
#[[2]]
#[1] "vice president"
#
#[[3]]
#[1] "health minister"
#
#[[4]]
#[1] "education minister"

或使用stringr

library(stringr)
pat <- paste0("\\w+\\s(", paste(tolower(keywords), collapse = "|"), ")")
str_extract_all(tolower(column), pat)