Question

我有一个文本载体，其中已经注册了药物的名称，另一个带有新药的名称。我想知道新药是否与现有药物相似。

例如，如果supercure是一种可以由firm1或firm2生成的药物，supercure firm1 1000mg和supercure firm2 500mg已经注册，那么supercure firm1 500 mg应该与它们相关联

agrep允许在R中进行此类匹配，sapply允许对新列表中的每种药物执行此操作：

new<-c("supercure firm1 500mg","randomcure firm2 1000mg","unknowncure firm2 100mg")
registered<-c("supercure firm1 1000mg","supercure firm2 500mg","randomcure firm1 1000mg")
res<-unlist(sapply(new,agrep,x=registered))
res

正如预期的那样，supercure得到两场比赛，随机一场比赛和未知比赛没有比赛（这就是我想要的）。但是，sapply似乎更改了名称，因此没有重复：supercure firm1 500mg变为supercure firm1 500mg1和supercure firm1 500mg2：

supercure firm1 500mg1   supercure firm1 500mg2 randomcure firm2 1000mg 
                    1                       2                       3

这是一个问题，因为它阻止我从新列表中选择匹配的药物：

new[new %in% names(res)]只捕获随机句（因为超级保险的名称已被更改）。

我可以想办法通过相当优雅的文本处理来解决这个问题，但有没有更聪明的方法来获得找到匹配的新药列表？

理想的输出是：

supercure firm1 500mg   supercure firm1 500mg randomcure firm2 1000mg 
                    1                       2                       3

Answer 1

您可以尝试将其设为数据框stack并使用setNames使其成为命名向量，即

d1 <- unique(stack(data.frame(Filter(length, sapply(new,agrep,x=registered)))))
#  values                     ind
#1      1   supercure.firm1.500mg
#2      2   supercure.firm1.500mg
#3      3 randomcure.firm2.1000mg

setNames(d1$values, d1$ind)
#  supercure.firm1.500mg   supercure.firm1.500mg randomcure.firm2.1000mg 
#                      1                       2                       3

Answer 2

sapply没有更改名称unlist。这给出了所需的输出：

x <- sapply(new,agrep,x=registered)
setNames(unlist(x),rep(names(x),lengths(x)))
#  supercure firm1 500mg   supercure firm1 500mg randomcure firm2 1000mg 
#                      1                       2                       3

如何不用sapply改变重复的名字？

2 个答案: