我正在学习使用R,所以请多多包涵。
我有一个Google Play商店应用(master_tib)的数据集。每行都是一个Play商店应用。标题为描述的列包含有关应用程序功能的文本。
master_tib
App Description
App1 Reduce your depression and anxiety
App2 Help your depression
App3 This app helps with Anxiety
App4 Dog walker app 3000
我还有一个标签df(master_tags),其中包含我已预定义的重要单词。标题标签只有一列,每行包含一个标签。
master_tag
Tag
Depression
Anxiety
Stress
Mood
我的目标是根据描述中标记的存在,使用master_tags df中的标记来标记master_tib df中的应用。然后它将在新列中打印标签。 最终结果将是如下所示的master_tib df:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
App2 Help your depression depression
App3 This app helps with anxiety anxiety
App4 Dog walker app 3000 FALSE
以下是到目前为止我使用str_detect和mapply的组合所做的事情:
# define function to use in mapply
detect_tag <- function(description, tag){
if(str_detect(description, tag, FALSE)) {
return (tag)
} else {
return (FALSE)
}
}
index <- mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)
master_tib[index,]
不幸的是,只有第一个标签正在通过。
App Description Tag
App1 Reduce your depression and anxiety depression
而不是所需的:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
我还没有将结果打印到新列中。很想听听任何人的见解或想法,并为我糟糕的R语言技巧道歉。
答案 0 :(得分:4)
您可以使用master_tag
组合来自str_c
的单词,然后使用str_extract_all
来获取所有与模式匹配的单词。
library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description),
str_c('\\b', tolower(master_tag$Tag), '\\b', collapse = "|")),
function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression" "anxiety" ""
数据
master_tag <- structure(list(Tag = c("Depression", "Anxiety", "Stress", "Mood"
)), class = "data.frame", row.names = c(NA, -4L))
master_tib <- structure(list(App = c("App1 ", "App2 ", "App3 ", "App4 "
), Description = c("Reduce your depression and anxiety", "Help your depression",
"This app helps with Anxiety", "Dog walker app 3000")), row.names = c(NA,
-4L), class = "data.frame")
答案 1 :(得分:2)
使用tidyverse
(dplyr
,stringr
,tidyr
)中的多个软件包以及@Ronak Shah的答案中显示的数据。
首先将标签转换为模式:
pattern <- master_tags$Tag %>%
tolower() %>%
str_c(collapse="|")
然后找到所有匹配项并创建所需的输出:
master_tib %>%
mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
unnest(Tag, keep_empty = TRUE) %>%
group_by(App, Description) %>%
summarise(Tag = str_c(Tag, collapse=", "))
这产生
# A tibble: 4 x 3
# Groups: App [4]
App Description Tag
<chr> <chr> <chr>
1 App1 Reduce your depression and anxiety depression, anxiety
2 App2 Help your depression depression
3 App3 This app helps with Anxiety anxiety
4 App4 Dog walker app 3000 NA
答案 2 :(得分:1)
类似于@RonakShah的答案,但底数为R:
apply(
sapply(master_tag$Tag, grepl, master_tib$Description, ignore.case = TRUE),
1, function(a) paste(master_tag$Tag[a], collapse = ","))
# [1] "Depression,Anxiety" "Depression" "Anxiety"
# [4] ""
(并且没有较低的外壳或“逗号空间”的细微之处,如果需要,可以轻松添加)。