Question

如果我有一个包含以下列的数据框：

df$text <- c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example")

和字符串如此：

keywords <- c("not that long", "This string", "example", "helps")

我正在尝试向我的数据框添加一列，其中包含每行文本中存在的关键字列表：

DF $关键字：

1 c("This string","not that long")    
2 c("This string","not that long")    
3 c("helps","example")

虽然我不确定如何1）从文本列中提取匹配的单词，2）然后如何在新列的每一行中列出匹配单词

Answer 1

也许是这样的：

df = data.frame(text=c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example"))
keywords <- c("not that long", "This string", "example", "helps")

df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords,grepl,x)]})

输出：

                                                 text                   keywords
1                        This string is not that long not that long, This string
2 This string is a bit longer but still not that long not that long, This string
3                This one just helps with the example             example, helps

外部lapply循环df$text，内部lapply检查keywords的每个元素（如果它位于df$text元素中）。所以稍微长一点但也许更容易阅读的等价物是：

df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords, function(y){grepl(y,x)})]})

希望这有帮助！

Answer 2

我们可以使用str_extract

中的stringr进行提取

library(stringr)
df$keywords <- str_extract_all(df$text, paste(keywords, collapse = "|"))
df
#                                                text                   keywords
#1                        This string is not that long This string, not that long
#2 This string is a bit longer but still not that long This string, not that long
#3                This one just helps with the example             helps, example

或链中

library(dplyr)
df %>%
   mutate(keywords = str_extract_all(text, paste(keywords, collapse = "|")))

根据文本列添加列出的关键字（字符串）列

2 个答案: