Question

我正在学习使用R，所以请多多包涵。

我有一个Google Play商店应用（master_tib）的数据集。每行都是一个Play商店应用。标题为描述的列包含有关应用程序功能的文本。

master_tib

App     Description
App1    Reduce your depression and anxiety
App2    Help your depression 
App3    This app helps with Anxiety 
App4    Dog walker app 3000

我还有一个标签df（master_tags），其中包含我已预定义的重要单词。标题标签只有一列，每行包含一个标签。

master_tag

Tag
Depression
Anxiety
Stress
Mood

我的目标是根据描述中标记的存在，使用master_tags df中的标记来标记master_tib df中的应用。然后它将在新列中打印标签。最终结果将是如下所示的master_tib df：

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety
App2    Help your depression                   depression
App3    This app helps with anxiety            anxiety
App4    Dog walker app 3000                    FALSE

以下是到目前为止我使用str_detect和mapply的组合所做的事情：

# define function to use in mapply

detect_tag <- function(description, tag){ 
  if(str_detect(description, tag, FALSE)) {
    return (tag)
  } else { 
    return (FALSE)
  }
}

index <-  mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)

master_tib[index,]

不幸的是，只有第一个标签正在通过。

App     Description                            Tag
App1    Reduce your depression and anxiety     depression

而不是所需的：

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety

我还没有将结果打印到新列中。很想听听任何人的见解或想法，并为我糟糕的R语言技巧道歉。

Answer 1

您可以使用master_tag组合来自str_c的单词，然后使用str_extract_all来获取所有与模式匹配的单词。

library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description), 
              str_c('\\b', tolower(master_tag$Tag), '\\b', collapse = "|")), 
              function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression"          "anxiety"             ""

数据

master_tag <- structure(list(Tag = c("Depression", "Anxiety", "Stress", "Mood"
)), class = "data.frame", row.names = c(NA, -4L))

master_tib <- structure(list(App = c("App1  ", "App2  ", "App3  ", "App4  "
), Description = c("Reduce your depression and anxiety", "Help your depression", 
"This app helps with Anxiety", "Dog walker app 3000")), row.names = c(NA, 
-4L), class = "data.frame")

Answer 2

使用tidyverse（dplyr，stringr，tidyr）中的多个软件包以及@Ronak Shah的答案中显示的数据。首先将标签转换为模式：

pattern <- master_tags$Tag %>%
  tolower() %>%
  str_c(collapse="|")

然后找到所有匹配项并创建所需的输出：

master_tib %>%
  mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
  unnest(Tag, keep_empty = TRUE) %>%
  group_by(App, Description) %>% 
  summarise(Tag = str_c(Tag, collapse=", "))

这产生

# A tibble: 4 x 3
# Groups:   App [4]
  App   Description                        Tag                
  <chr> <chr>                              <chr>              
1 App1  Reduce your depression and anxiety depression, anxiety
2 App2  Help your depression               depression         
3 App3  This app helps with Anxiety        anxiety            
4 App4  Dog walker app 3000                NA

Answer 3

类似于@RonakShah的答案，但底数为R：

apply(
  sapply(master_tag$Tag, grepl, master_tib$Description, ignore.case = TRUE),
  1, function(a) paste(master_tag$Tag[a], collapse = ","))
# [1] "Depression,Anxiety" "Depression"         "Anxiety"           
# [4] ""

（并且没有较低的外壳或“逗号空间”的细微之处，如果需要，可以轻松添加）。

从另一个df中的字符串中检测一个df中的多个字符串，如果检测到，则返回检测到的字符串

3 个答案: