我正在尝试在R的数据框中的字符串的char列中识别,匹配和提取双字短语。
我有一个例子列表,例如:
phrases <- as.list(c("Business","Business Process", "Processes", "Business Processes"))
和一个字符串:
string <- "brings seamless integration among the business processes and financials."
我正在使用str_extract_all并且像这样:
sapply(str_extract_all(tolower(string), paste(tolower(phrases), collapse = "|")), function(s) paste(s, collapse=', '))
这只是识别单个词的术语,而不是识别所需的双词短语“业务流程”。
目前的输出是:[1]“业务,流程”
但我希望能够获得“业务,流程,业务流程”
我尝试过使用模式\\ b并在双字短语之间添加\\ s,但它没有帮助。
我应该如何提取单词和双词短语?
修改 我需要将匹配项保留为数据框中的一列 - 我尝试了以下三个建议并收到以下错误:
$<-.data.frame
中的错误(*tmp*
,词组,值= c(“商家”,“流程”,:
替换有267行,数据有495个
我的DataFrame有多列,其中一列包含要与短语列表匹配的字符串。我需要能够在字符串的同一行内以逗号分隔的值拉出所有匹配项。期望的输出
Row, String, Phrases
1, Businesses are great, business
2, Great thing are great,
3, Processes are great, processes
4, Business Processes are great for business, business processes, processes, business
答案 0 :(得分:0)
这似乎有效
tmp <- sapply(phrases,function(x) regmatches(string,gregexpr(paste0("\\b",x,"\\b"),string,ignore.case = T)))
> unlist(tmp)
[1] "business" "processes" "business processes"
答案 1 :(得分:0)
unname(mapply(function(x,y)str_extract(x,paste0(tolower(y),"\\b")),string,phrases))
[1] "business" NA "processes" "business processes"
答案 2 :(得分:0)
使用grepl
:
unlist(phrases[sapply(phrases, function(x) grepl(paste0("\\<", x, "\\>"), string, ignore.case = T))])
#[1] "Business" "Processes" "Business Processes"
或所有小写:
unlist(tolower(phrases)[sapply(tolower(phrases), function(x) grepl(paste0("\\<", x, "\\>"), tolower(string)))])
#[1] "business" "processes" "business processes"