R:查找两个对象之间的部分字符串的模式匹配(不区分大小写)

时间:2015-09-06 20:22:36

标签: r grep pattern-matching string-matching grepl

我有两个数据帧,我需要匹配的字段(在这种情况下是树种),但是随意写入不同的格式和命名方案。

d1是物种的关键,具有物种的“通用名称”和“物种群”字段; d2是具有许多树的原始树数据(因此,物种重复):

d1 <- data.frame(common.name = c("balsam fir", "white pine", "sugar maple", "red oak", 
"Lilac tree", "Chokecherry", "beark oak", "bigtooth aspen"), 
species.group = c("a","b","c","d","d","a","b","c"))

> d1  
common.name     species.group  
1  balsam fir        a  
2  white pine        b  
3  sugar maple       c  
4  red oak           d  
5  lilac tree        d  
6  chokecherry       a  
7  beark oak         b  
8  bigtooth aspen    c  



d2 <- data.frame(generic.name = c("Fir","Pine", "Oak", "Maple", "Elm", "Cherry", "Aspen", "Pine", "Pine", "Oak", "Fir", "Oak", "Pine", "Oak", "Oak", "Oak"))

> d2
   generic.name
1        Fir
2        Pine
3        Oak
4        Maple
5        Elm
6        Cherry
7        Aspen
8        Pine
9        Pine
10       Oak
11       Fir
12       Oak
13       Pine
14       Oak
15       Oak
16       Oak

我需要通过模式匹配d1 $ common.name和d2 $ generic.name:

将树种d2 $ generic.name分组为相应的d1 $ species.group。
key.names <- tolower(unlist(strsplit(d1$common.name, " ", fixed=TRUE))))
key.groups <- unique(tolower(d1$species.group))

d2$species.group <- function(matching)?

> d2
       generic.name    species.group
1           Fir             a
2          Pine             b
3           Oak             c
4         Maple             d
5           Elm             d
6        Cherry             c
7         Aspen             d
8          Pine             b
9          Pine             b
10          Oak             c
11          Fir             a
12          Oak             c
13         Pine             b
14          Oak             c
15          Oak             c
16          Oak             c

对不完整的代码示例感到抱歉,但我完全不知道该函数会是什么样子。谢谢!

1 个答案:

答案 0 :(得分:1)

这样的事情:

d1 <- c("balsam fir", "white pine", "sugar maple", "red oak", "lilac tree")
d2 <- c("Fir","Pine", "Oak", "Maple", "Japanese Lilac")

words <- tolower(unlist(strsplit(d2, " ", fixed=TRUE)))
words
# [1] "fir"      "pine"     "oak"      "maple"    "japanese" "lilac"   

get.category <- function(x)names(which(sapply(words,grepl,x,fixed=TRUE)))
data.frame(name=d1,
           category=sapply(d1, get.category),
           row.names=seq_along(d1))
#          name category
# 1  balsam fir      fir
# 2  white pine     pine
# 3 sugar maple    maple
# 4     red oak      oak
# 5  lilac tree    lilac