我有一个分类变量的数据框,看起来像这样(但更长)。
taxTest <- structure(list(Kingdom = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Bacteria", class = "factor"),
Phylum = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Bacteroidetes",
"Proteobacteria"), class = "factor"), Class = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("Bacteroidia", "Gammaproteobacteria"
), class = "factor"), Order = structure(c(2L, 1L, 1L, 1L,
1L), .Label = c("Bacteroidales", "Enterobacteriales"), class = "factor"),
Family = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroidaceae",
"Enterobacteriaceae", "Prevotellaceae"), class = "factor"),
Genus = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroides",
"Escherichia/Shigella", "Prevotella"), class = "factor"),
Genus.y = structure(c(NA, 1L, 2L, 1L, 2L), .Label = c("Bacteroides",
"Prevotella"), class = "factor"), Species = structure(c(1L,
4L, 2L, 5L, 3L), .Label = c("albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris",
"copri", "disiens", "dorei", "dorei/vulgatus"), class = "factor")), .Names = c("Kingdom",
"Phylum", "Class", "Order", "Family", "Genus", "Genus.y", "Species"
), row.names = c("tax1", "tax2", "tax3", "tax4", "tax5"), class = "data.frame")
我想从这个数据中得出一个简短的分类名称,因此我运行的函数比这个稍微复杂一点(它必须处理一堆这些分类级别中的NA数据处理),但是以同样的方式失败。
library(dplyr)
tag_taxon <- function(tvdf){
species <- tvdf %>% dplyr::select(Species) %>% unlist
genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
genus <- genus2 %>% na.omit %>% .[1]
#genus <- tvdf %>% dplyr::select(Genus) %>% unlist
out <- paste(genus, species)
out }
如果我对表格的每一行运行此功能,我会得到一个我期待的答案,一个属和物种名称。
for(i in 1:5){
print(taxTest %>% .[i,] %>% tag_taxon)
}
[1]“Escherichia / Shigella albertii / boydii / coli / coli,/ dysenteriae / enterica / fergusonii / flexneri / sonnei / vulneris”
[1]“Bacteroides dorei”
[1]“Prevotella copri”
[1]“Bacteroides dorei / vulgatus”
[1]“Prevotella disiens”
我觉得我应该能够使用dplyr在数据帧的每一行上应用此函数。不幸的是,这会产生反直觉的结果。
taxTest %>% rowwise %>% tag_taxon
'Escherichia / Shigella albertii / boydii / coli / coli,/ dysenteriae / enterica / fergusonii / flexneri / sonnei / vulneris' 'Escherichia / Shigella dorei''Escherichia / Shigella copri' 'Escherichia / Shigella dorei / vulgatus''Emcherichia / Shigella disiens'
我想也许apply函数也可以在这里工作,但这只是彻底失败了一个神秘的错误信息。
apply(taxTest, 1, tag_taxon)
UseMethod(“select_”)出错:“select_”没有适用的方法 应用于类“字符”的对象Traceback:
- apply(taxTest,1,tag_taxon)
- FUN(newX [,i],...)
- tvdf%&gt;%dplyr :: select(Species)%&gt;%unlist#at file 4 of file
- withVisible(eval(quote(
_fseq
(_lhs
)),env,env))- eval(quote(
_fseq
(_lhs
)),env,env)- eval(quote(
_fseq
(_lhs
)),env,env)_fseq
(_lhs
)- freduce(value,
_function_list
)- function_list [I]
- dplyr :: select(。,Species)
- select.default(。,Species)
- 选择_(。data,.dots = compat_as_lazy_dots(...))
醇>
关于这里发生了什么的任何想法?我可以用for循环完全解决这个问题,但如果可以,我宁愿使用dplyr。
谢谢!
编辑:还有一件事!我忘了在我的原始帖子中提到如果一个人不注释#genus <- tvdf %>% dplyr::select(Genus) %>% unlist
行(也就是说,我不尝试将物种信息附加到属信息中),则plyr函数会给出预期的结果。
答案 0 :(得分:1)
paste
是矢量化的,因此不需要按行操作单独的函数。下面的代码要求Genus
和Genus.y
是字符而不是因素,所以我在运行代码之前完成了转换。
taxTest[,c("Genus","Genus.y")] = lapply(taxTest[,c("Genus","Genus.y")] , as.character)
taxTest %>%
mutate(tag = gsub("NA ", "", paste(Genus, ifelse(Genus.y==Genus, NA, Genus.y), Species)))
gsub
是删除NA
加上后面的空格。以下是tag
列的内容:
tag 1 Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris 2 Bacteroides dorei 3 Prevotella copri 4 Bacteroides dorei/vulgatus 5 Prevotella disiens
要查看原始代码的内容,我们可以向cat
添加一些tag_taxon
语句。
tag_taxon <- function(tvdf){
species <- tvdf %>% dplyr::select(Species) %>% unlist
genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
cat("genus2 = ", genus2,"\n")
genus <- genus2 %>% na.omit %>% .[1]
cat("genus = ", genus,"\n")
#genus <- tvdf %>% dplyr::select(Genus) %>% unlist
out <- paste(genus, species)
out }
for(i in 1:5){
print(taxTest %>% .[i,] %>% tag_taxon)
}
genus2 = Escherichia/Shigella NA genus = Escherichia/Shigella [1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris" genus2 = Bacteroides Bacteroides genus = Bacteroides [1] "Bacteroides dorei" genus2 = Prevotella Prevotella genus = Prevotella [1] "Prevotella copri" genus2 = Bacteroides Bacteroides genus = Bacteroides [1] "Bacteroides dorei/vulgatus" genus2 = Prevotella Prevotella genus = Prevotella [1] "Prevotella disiens"
好的,for循环正在做我们期望的事情。现在是dplyr::rowwise
:
taxTest %>% rowwise %>% tag_taxon
genus2 = Escherichia/Shigella Bacteroides Prevotella Bacteroides Prevotella NA Bacteroides Prevotella Bacteroides Prevotella genus = Escherichia/Shigella [1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris" [2] "Escherichia/Shigella dorei" [3] "Escherichia/Shigella copri" [4] "Escherichia/Shigella dorei/vulgatus" [5] "Escherichia/Shigella disiens"
因此dplyr
返回genus2
一个向量,其中Genus
和Genus.y
中的所有值都已连接在一起(NA
值除外)。然后genus
只保留第一个值并反复使用它。这可能与dplyr
执行非标准评估的方式有关,但我并不积极。
如果您想使用自己的功能,它将按照by_row
包中purrrlyr
的预期方式运行:
library(purrrlyr)
taxTest %>% by_row(tag_taxon)