简短版本:如何使用在另一个数据框中找到的字符串替换数据框中的值?
更长的版本:我是一名与许多蜜蜂一起工作的生物学家。我有一个拥有数千只蜜蜂的数据集。每行都有一个唯一的蜜蜂ID#以及有关该样本的所有相关信息(捕获数据,GPS位置等)。尚未输入每只蜜蜂的物种信息,因为它们需要很长时间才能识别它们。在IDing时,我最终得到了数百只蜜蜂,所有相同的物种。我将这些输入到一个单独的数据框中。我正在尝试编写代码来更新原始数据文件的物种信息(家庭,属,种,性等),因为我是蜜蜂的ID。目前,在原始数据文件中,物种信息是空白的并且在R中被解释为NA。我想让R找到所有唯一的蜜蜂ID#并填写物种信息,但我无法弄清楚如何用字符串替换NA值(例如“Andrenidae”)
以下是我想要做的一个简单示例:
rawData<-data.frame(beeID=c(1:20),family=rep(NA,20))
speciesInfo<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7))
rawData[rawData$beeID == 4,"family"] <- speciesInfo[speciesInfo$beeID == 4,"family"]
所以,我正在按照自己的意愿更换东西,但是使用的是数字而不是姓氏(字符串)。我最终想要做的是写一个小循环来添加所有物种信息,例如:
for (i in speciesInfo$beeID){
rawData[rawData$beeID == i,"family"] <- speciesInfo[speciesInfo$beeID == i,"family"]
}
提前感谢任何建议!
干杯,
扎克
编辑:
我只是注意到下面的前两个方法每次都会添加一个新列,如果我需要多次添加物种信息(这通常会这样做)会导致问题。例如:
rawData<-data.frame(beeID=c(1:20),family=rep(NA,20))
Andrenidae<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7))
Halictidae<-data.frame(beeID=seq(1,20,3)+1,family=rep("Halictidae",7))
# using join
library(plyr)
rawData <- join(rawData, Andrenidae, by = "beeID", type = "left")
rawData <- join(rawData, Halictidae, by = "beeID", type = "left")
# using merge
rawData <- merge(x=rawData,y=Andrenidae,by='beeID',all.x=T,all.y=F)
rawData <- merge(x=rawData,y=Halictidae,by='beeID',all.x=T,all.y=F)
有没有办法折叠列,以便我有一个统一的数据框?或者更新rawData而不是每次添加新列的方法?提前致谢!
答案 0 :(得分:4)
这是我觉得适合你的功能。这使用match
来查找注释数据框中的值并对其进行索引,然后替换rawData中的值。
replaceID <- function(to,from,mergeBy,values){
x <- match(from[,mergeBy],to[,mergeBy])
to[,values][x] <- as.character(from[,values])
return(to)
}
> rawData <- replaceID(rawData,Halictidae,"beeID","family")
> rawData
beeID family
1 1 <NA>
2 2 Halictidae
3 3 <NA>
4 4 <NA>
5 5 Halictidae
6 6 <NA>
7 7 <NA>
8 8 Halictidae
9 9 <NA>
10 10 <NA>
11 11 Halictidae
12 12 <NA>
13 13 <NA>
14 14 Halictidae
15 15 <NA>
16 16 <NA>
17 17 Halictidae
18 18 <NA>
19 19 <NA>
20 20 Halictidae
答案 1 :(得分:3)
另一种选择是在?join
plyr
library(plyr)
#Adding family ahead of time was unnecessary so I'll remove it alongside the join.
join(rawData, speciesInfo, by = "beeID", type = "left")[,-2]
beeID family
1 1 Andrenidae
2 2 <NA>
3 3 <NA>
4 4 Andrenidae
5 5 <NA>
6 6 <NA>
7 7 Andrenidae
8 8 <NA>
9 9 <NA>
10 10 Andrenidae
11 11 <NA>
12 12 <NA>
13 13 Andrenidae
14 14 <NA>
15 15 <NA>
16 16 Andrenidae
17 17 <NA>
18 18 <NA>
19 19 Andrenidae
20 20 <NA>
# If you anticipate adding new species over time,
# simply rbind those into a single reference data.frame to merge with your rawData.
# Like so:
library(plyr)
rawData <- join(rawData, rbind(Andrenidae, Halictidae), by = "beeID", type = "left")
# To keep you code clean, you could do this step ahead of time
species_list <- rbind(Andrenidae, Halictidae)
rawData <- join(rawData, species_list, by = "beeID", type = "left")
答案 2 :(得分:2)
您可以使用merge功能,例如:
rawData <- data.frame(beeID=c(1:20),family=rep(NA,20))
speciesInfo <- data.frame(beeID=seq(1,20,3),
family=c(rep('Halictidae',4), rep("Andrenidae",3)))
merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F)
merged$family.x <- NULL # remove the family.x column
names(merged) <- c('beeID','family') # rename the columns
<强> N.B。强>
没有必要使用rawData
列初始化family
合并功能会自动添加它,例如:
rawData <- data.frame(beeID=c(1:20))
speciesInfo <- data.frame(beeID=seq(1,20,3),
family=c(rep('Halictidae',4), rep("Andrenidae",3)))
merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F)
> merged
beeID family
1 1 Halictidae
2 2 <NA>
3 3 <NA>
4 4 Halictidae
5 5 <NA>
6 6 <NA>
7 7 Halictidae
8 8 <NA>
9 9 <NA>
10 10 Halictidae
11 11 <NA>
12 12 <NA>
13 13 Andrenidae
14 14 <NA>
15 15 <NA>
16 16 Andrenidae
17 17 <NA>
18 18 <NA>
19 19 Andrenidae
20 20 <NA>
答案 3 :(得分:2)
data.table
解决方案,具有内存和时间效率。
stringsAsFactors = F
(do.call(rbind,list)/ rbind的超快版本)rawData
对象并删除了系列。创建数据 -
rawData <- data.frame(beeID = c(1:20), other_stuff = sample(letters, 20), stringsAsFactors = F)
Andrenidae <- data.frame(beeID = seq(1, 20, 3), family = rep("Andrenidae", 7), stringsAsFactors = F)
Halictidae <- data.frame(beeID = seq(1, 20 , 3)+ 1, family = rep("Halictidae", 7), stringsAsFactors = F)
library(data.table)
# convert to data.table
rawDT <- as.data.table(rawData)
# combine the list of Species-specific data.frames into a large data.table
speciesInfo <- rbindlist(list(Andrenidae, Halictidae))
# set the keys, to allow efficient use of data.table and its merging
# abilities. The keys are the same for both
setkeyv(rawDT, 'beeID')
setkeyv(speciesInfo, 'beeID')
# merge by key
speciesInfo[rawDT, nomatch = NA]
## beeID family other_stuff
## 1: 1 Andrenidae s
## 2: 2 Halictidae x
## 3: 3 NA i
## 4: 4 Andrenidae e
## 5: 5 Halictidae v
## 6: 6 NA q
## 7: 7 Andrenidae w
## 8: 8 Halictidae c
## 9: 9 NA u
## 10: 10 Andrenidae z
## 11: 11 Halictidae y
## 12: 12 NA a
## 13: 13 Andrenidae l
## 14: 14 Halictidae r
## 15: 15 NA h
## 16: 16 Andrenidae o
## 17: 17 Halictidae n
## 18: 18 NA g
## 19: 19 Andrenidae p
## 20: 20 Halictidae m
或
rawDT[speciesInfo]
## beeID other_stuff family
## 1: 1 s Andrenidae
## 2: 2 x Halictidae
## 3: 4 e Andrenidae
## 4: 5 v Halictidae
## 5: 7 w Andrenidae
## 6: 8 c Halictidae
## 7: 10 z Andrenidae
## 8: 11 y Halictidae
## 9: 13 l Andrenidae
## 10: 14 r Halictidae
## 11: 16 o Andrenidae
## 12: 17 n Halictidae
## 13: 19 p Andrenidae
## 14: 20 m Halictidae
您感兴趣的数据