我有两个数据帧,我需要匹配的字段(在这种情况下是树种),但是随意写入不同的格式和命名方案。
d1是物种的关键,具有物种的“通用名称”和“物种群”字段; d2是具有许多树的原始树数据(因此,物种重复):
d1 <- data.frame(common.name = c("balsam fir", "white pine", "sugar maple", "red oak",
"Lilac tree", "Chokecherry", "beark oak", "bigtooth aspen"),
species.group = c("a","b","c","d","d","a","b","c"))
> d1
common.name species.group
1 balsam fir a
2 white pine b
3 sugar maple c
4 red oak d
5 lilac tree d
6 chokecherry a
7 beark oak b
8 bigtooth aspen c
d2 <- data.frame(generic.name = c("Fir","Pine", "Oak", "Maple", "Elm", "Cherry", "Aspen", "Pine", "Pine", "Oak", "Fir", "Oak", "Pine", "Oak", "Oak", "Oak"))
> d2
generic.name
1 Fir
2 Pine
3 Oak
4 Maple
5 Elm
6 Cherry
7 Aspen
8 Pine
9 Pine
10 Oak
11 Fir
12 Oak
13 Pine
14 Oak
15 Oak
16 Oak
我需要通过模式匹配d1 $ common.name和d2 $ generic.name:
将树种d2 $ generic.name分组为相应的d1 $ species.group。key.names <- tolower(unlist(strsplit(d1$common.name, " ", fixed=TRUE))))
key.groups <- unique(tolower(d1$species.group))
d2$species.group <- function(matching)?
> d2
generic.name species.group
1 Fir a
2 Pine b
3 Oak c
4 Maple d
5 Elm d
6 Cherry c
7 Aspen d
8 Pine b
9 Pine b
10 Oak c
11 Fir a
12 Oak c
13 Pine b
14 Oak c
15 Oak c
16 Oak c
对不完整的代码示例感到抱歉,但我完全不知道该函数会是什么样子。谢谢!
答案 0 :(得分:1)
这样的事情:
d1 <- c("balsam fir", "white pine", "sugar maple", "red oak", "lilac tree")
d2 <- c("Fir","Pine", "Oak", "Maple", "Japanese Lilac")
words <- tolower(unlist(strsplit(d2, " ", fixed=TRUE)))
words
# [1] "fir" "pine" "oak" "maple" "japanese" "lilac"
get.category <- function(x)names(which(sapply(words,grepl,x,fixed=TRUE)))
data.frame(name=d1,
category=sapply(d1, get.category),
row.names=seq_along(d1))
# name category
# 1 balsam fir fir
# 2 white pine pine
# 3 sugar maple maple
# 4 red oak oak
# 5 lilac tree lilac