R部分字符串匹配和返回值(在R中)

时间:2017-01-04 19:50:24

标签: r string match return-value product

我有多个采购数据库,我需要在其上运行我建立的“关键字”列表以识别某些产品,如果匹配,我想将产品标记为手术类别。

这是一个例子。

采购数据库(实际上我有超过2,000,000行要经过):

d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)

关键字列表和返回值(实际列表更长):

kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")

我想找到包含我的关键字字符串prod_desc的产品kw,如果匹配,我想在d数据框中添加一列,该列会返回与category数据框中的kw相关联的kw

现在我可以使用以下代码获得所需的结果:

d$match <- ifelse(d$cat <- grepl(paste(kw$kw,collapse="|"), d$name,ignore.case = TRUE) == "TRUE","SS_Bandelette","-")

但是这段代码效率不高,因为我有大约350个映射到30个不同类别的关键字。如果我的某个关键字被触发,我可以使用哪些代码在d数据框中自动返回该类别?

非常感谢你的帮助。

菲尔

2 个答案:

答案 0 :(得分:1)

JsonNode

答案 1 :(得分:0)

# Create dataframe as per original question
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
# Create keywords as per origianl question
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
# Assume you want match/tag string on word boundaries? If not; "BANDELETTE TVTO-OBTRYX HALO" would match to "tvt" for instance.
kw$kw <- paste0("\\b",kw$kw,"\\b")

x <- sapply(kw$kw, function(x) grepl(tolower(x), tolower(d$prod_desc)))
d$Match <- apply(x, 1, function(i) paste0(names(i)[i]))
d$Match <- kw$category[match(d$Match,kw$kw)]
d
#                             prod_desc label         Match
# 1         BANDELETTE TVTO-OBTRYX HALO     1 ss_bandelette
# 2        BANDELETTE MINI ARC PRECISES     2 ss_bandelette
# 3  BANDELETTE D'ANALYSE POUR GLYCEMIE     3 ss_bandelette
# 4  DIACH. BANDELETTE STER 19MM X 72MM     4 ss_bandelette
# 5                   SLING MALE SYSTEM     5          <NA>
# 6                           DIACHILON     6          <NA>
# 7                            AIGUILLE     7          <NA>
# 8                                GANT     8          <NA>
# 9                               LABEL     9          <NA>
# 10                             CRAYON    10          <NA>