如果我有一个包含以下列的数据框:
df$text <- c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example")
和字符串如此:
keywords <- c("not that long", "This string", "example", "helps")
我正在尝试向我的数据框添加一列,其中包含每行文本中存在的关键字列表:
DF $关键字:
1 c("This string","not that long")
2 c("This string","not that long")
3 c("helps","example")
虽然我不确定如何1)从文本列中提取匹配的单词,2)然后如何在新列的每一行中列出匹配单词
答案 0 :(得分:3)
也许是这样的:
df = data.frame(text=c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example"))
keywords <- c("not that long", "This string", "example", "helps")
df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords,grepl,x)]})
输出:
text keywords
1 This string is not that long not that long, This string
2 This string is a bit longer but still not that long not that long, This string
3 This one just helps with the example example, helps
外部lapply
循环df$text
,内部lapply
检查keywords
的每个元素(如果它位于df$text
元素中)。所以稍微长一点但也许更容易阅读的等价物是:
df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords, function(y){grepl(y,x)})]})
希望这有帮助!
答案 1 :(得分:2)
我们可以使用str_extract
stringr
进行提取
library(stringr)
df$keywords <- str_extract_all(df$text, paste(keywords, collapse = "|"))
df
# text keywords
#1 This string is not that long This string, not that long
#2 This string is a bit longer but still not that long This string, not that long
#3 This one just helps with the example helps, example
或链中
library(dplyr)
df %>%
mutate(keywords = str_extract_all(text, paste(keywords, collapse = "|")))