我有一个包含两列的数据框:第一列是关键字,第二列是关联的类别。
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
我希望每当我有一个新标签时,都要检查是否有对应的类别,如果有,请附加该类别。
因此对于下面的示例,新列的第一行将附加值“ category1”:
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
帮助非常感谢!
答案 0 :(得分:1)
这里只需使用str_extract
来获取相关文本并加入参考表。
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
library(data.table)
library(tidyverse)
ref_tbl <-
# data.table(
# For the AntoniosK's sugguestion, recommend dplyr-like function.
tibble(
keywords = keywords
,categories = categories
)
# as.data.table(
# For the AntoniosK's sugguestion, recommend dplyr-like function.
as_tibble(
new_labels
) %>%
mutate(ref_key = str_extract(new_labels
# ,'keyword[:digit:]'
,(
keywords %>%
str_flatten('|')
# regular expression
)
)) %>%
left_join(
ref_tbl
,by=c('ref_key'='keywords')
)
#> # A tibble: 3 x 3
#> value ref_key categories
#> <chr> <chr> <chr>
#> 1 keyword1 qefjhqek keyword1 category1
#> 2 hfaef <NA> <NA>
#> 3 fihiz <NA> <NA>
由reprex package(v0.2.1)于2018-11-10创建
根据@AntoniosK的问题,我在data.table
和tibble
之间进行了比较。实际上,有一个明显的迹象表明支持tibble
比data.table
更好。
答案 1 :(得分:0)
这是一个tidyverse
解决方案,可在新标签和关键字之间创建所有组合,点匹配并提取关键字,然后从查找表中加入类别:
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
new_labels = c("keyword1 qefjhqek", "hfaef", "fihiz")
library(tidyverse)
expand.grid(new_labels=new_labels, keywords=lookup_table$keywords) %>% # create all combinations
mutate_all(as.character) %>% # update to character (if needed)
mutate(v = map2_lgl(new_labels, keywords, ~grepl(.y, .x))) %>% # check if there's a match
group_by(new_labels) %>% # for each new label
summarise(keywords = ifelse(sum(v) > 0, keywords[v==TRUE], NA)) %>% # get the keyword if there is one
left_join(lookup_table, by="keywords") %>% # join categoris
select(-keywords) # remove keywords
# # A tibble: 3 x 2
# new_labels categories
# <chr> <fct>
# 1 fihiz NA
# 2 hfaef NA
# 3 keyword1 qefjhqek category1
另一种受@Jianxiang的答案启发的版本,它使用str_extract
来获取相关的关键字,而不是创建所有组合:
data.frame(new_labels) %>%
mutate(keywords = str_extract(new_labels, str_flatten(lookup_table$keywords, "|"))) %>%
left_join(lookup_table, by="keywords") %>%
select(-keywords)
# new_labels categories
# 1 keyword1 qefjhqek category1
# 2 hfaef <NA>
# 3 fihiz <NA>
答案 2 :(得分:0)
library(tidyverse)
tibble(
label = new_labels,
keywords = unlist(
map2(new_labels, paste(lookup_table$keywords, collapse = "|"), str_extract)
)) %>%
left_join(lookup_table) %>%
select(- keywords)
# A tibble: 3 x 2
# label categories
# <chr> <fct>
# 1 keyword1 qefjhqek category1
# 2 hfaef <NA>
# 3 fihiz <NA>