根据R中的关键字创建类别

时间:2018-11-08 17:50:34

标签: r categories lookup lookup-tables

我有一个包含两列的数据框:第一列是关键字,第二列是关联的类别。

keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")

lookup_table <- data.frame(keywords, categories)

我希望每当我有一个新标签时,都要检查是否有对应的类别,如果有,请附加该类别。

因此对于下面的示例,新列的第一行将附加值“ category1”:

new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")

帮助非常感谢!

3 个答案:

答案 0 :(得分:1)

这里只需使用str_extract来获取相关文本并加入参考表。

keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")

lookup_table <- data.frame(keywords, categories)
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")

library(data.table)
library(tidyverse)
ref_tbl <- 
# data.table(
# For the AntoniosK's sugguestion, recommend dplyr-like function.
tibble(
    keywords = keywords
    ,categories = categories
)

# as.data.table(
# For the AntoniosK's sugguestion, recommend dplyr-like function.
as_tibble(
    new_labels
    ) %>% 
    mutate(ref_key = str_extract(new_labels
                                 # ,'keyword[:digit:]'
                                 ,(
                                   keywords %>% 
                                     str_flatten('|')
                                   # regular expression
                                 )
                                 )) %>% 
    left_join(
         ref_tbl
         ,by=c('ref_key'='keywords')
    )
#> # A tibble: 3 x 3
#>   value             ref_key  categories
#>   <chr>             <chr>    <chr>     
#> 1 keyword1 qefjhqek keyword1 category1 
#> 2 hfaef             <NA>     <NA>      
#> 3 fihiz             <NA>     <NA>

reprex package(v0.2.1)于2018-11-10创建


根据@AntoniosK的问题,我在data.tabletibble之间进行了比较。实际上,有一个明显的迹象表明支持tibbledata.table更好。

  1. tibble仅2990 ms-> 1st :enter image description here
  2. data.tableas.data.table 3240毫秒->第二 :enter image description here
  3. data.table仅3840毫秒->第三 :enter image description here

答案 1 :(得分:0)

这是一个tidyverse解决方案,可在新标签和关键字之间创建所有组合,点匹配并提取关键字,然后从查找表中加入类别:

keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)

new_labels = c("keyword1 qefjhqek", "hfaef", "fihiz")

library(tidyverse)

expand.grid(new_labels=new_labels, keywords=lookup_table$keywords) %>%  # create all combinations
  mutate_all(as.character) %>%                                          # update to character (if needed)
  mutate(v = map2_lgl(new_labels, keywords, ~grepl(.y, .x))) %>%        # check if there's a match
  group_by(new_labels) %>%                                              # for each new label
  summarise(keywords = ifelse(sum(v) > 0, keywords[v==TRUE], NA)) %>%   # get the keyword if there is one
  left_join(lookup_table, by="keywords") %>%                            # join categoris
  select(-keywords)                                                     # remove keywords

# # A tibble: 3 x 2
#   new_labels        categories
#   <chr>             <fct>     
# 1 fihiz             NA        
# 2 hfaef             NA        
# 3 keyword1 qefjhqek category1 

另一种受@Jianxiang的答案启发的版本,它使用str_extract来获取相关的关键字,而不是创建所有组合:

data.frame(new_labels) %>%
  mutate(keywords = str_extract(new_labels, str_flatten(lookup_table$keywords, "|"))) %>%
  left_join(lookup_table, by="keywords") %>%        
  select(-keywords)

#           new_labels categories
# 1 keyword1 qefjhqek  category1
# 2             hfaef       <NA>
# 3             fihiz       <NA>

答案 2 :(得分:0)

library(tidyverse)

tibble(
    label = new_labels,
    keywords = unlist(
      map2(new_labels, paste(lookup_table$keywords, collapse = "|"), str_extract)
    )) %>% 
    left_join(lookup_table) %>%
    select(- keywords)

# A tibble: 3 x 2
#   label             categories
#   <chr>             <fct>   
# 1 keyword1 qefjhqek category1
# 2 hfaef             <NA>    
# 3 fihiz             <NA>