从另一个df中的字符串中检测一个df中的多个字符串,如果检测到,则返回检测到的字符串

时间:2020-05-31 23:04:35

标签: r string tags mapply

我正在学习使用R,所以请多多包涵。

我有一个Google Play商店应用(master_tib)的数据集。每行都是一个Play商店应用。标题为描述的列包含有关应用程序功能的文本。

master_tib

App     Description
App1    Reduce your depression and anxiety
App2    Help your depression 
App3    This app helps with Anxiety 
App4    Dog walker app 3000 

我还有一个标签df(master_tags),其中包含我已预定义的重要单词。标题标签只有一列,每行包含一个标签。

master_tag

Tag
Depression
Anxiety
Stress
Mood

我的目标是根据描述中标记的存在,使用master_tags df中的标记来标记master_tib df中的应用。然后它将在新列中打印标签。 最终结果将是如下所示的master_tib df:

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety
App2    Help your depression                   depression
App3    This app helps with anxiety            anxiety
App4    Dog walker app 3000                    FALSE

以下是到目前为止我使用str_detect和mapply的组合所做的事情:

# define function to use in mapply

detect_tag <- function(description, tag){ 
  if(str_detect(description, tag, FALSE)) {
    return (tag)
  } else { 
    return (FALSE)
  }
}

index <-  mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)

master_tib[index,]

不幸的是,只有第一个标签正在通过。

App     Description                            Tag
App1    Reduce your depression and anxiety     depression

而不是所需的:

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety

我还没有将结果打印到新列中。很想听听任何人的见解或想法,并为我糟糕的R语言技巧道歉。

3 个答案:

答案 0 :(得分:4)

您可以使用master_tag组合来自str_c的单词,然后使用str_extract_all来获取所有与模式匹配的单词。

library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description), 
              str_c('\\b', tolower(master_tag$Tag), '\\b', collapse = "|")), 
              function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression"          "anxiety"             "" 

数据

master_tag <- structure(list(Tag = c("Depression", "Anxiety", "Stress", "Mood"
)), class = "data.frame", row.names = c(NA, -4L))

master_tib <- structure(list(App = c("App1  ", "App2  ", "App3  ", "App4  "
), Description = c("Reduce your depression and anxiety", "Help your depression", 
"This app helps with Anxiety", "Dog walker app 3000")), row.names = c(NA, 
-4L), class = "data.frame")

答案 1 :(得分:2)

使用tidyversedplyrstringrtidyr)中的多个软件包以及@Ronak Shah的答案中显示的数据。 首先将标签转换为模式:

pattern <- master_tags$Tag %>%
  tolower() %>%
  str_c(collapse="|")

然后找到所有匹配项并创建所需的输出:

master_tib %>%
  mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
  unnest(Tag, keep_empty = TRUE) %>%
  group_by(App, Description) %>% 
  summarise(Tag = str_c(Tag, collapse=", "))

这产生

# A tibble: 4 x 3
# Groups:   App [4]
  App   Description                        Tag                
  <chr> <chr>                              <chr>              
1 App1  Reduce your depression and anxiety depression, anxiety
2 App2  Help your depression               depression         
3 App3  This app helps with Anxiety        anxiety            
4 App4  Dog walker app 3000                NA 

答案 2 :(得分:1)

类似于@RonakShah的答案,但底数为R:

apply(
  sapply(master_tag$Tag, grepl, master_tib$Description, ignore.case = TRUE),
  1, function(a) paste(master_tag$Tag[a], collapse = ","))
# [1] "Depression,Anxiety" "Depression"         "Anxiety"           
# [4] ""                  

(并且没有较低的外壳或“逗号空间”的细微之处,如果需要,可以轻松添加)。

相关问题