dplyr使用rowwise运算符给出不同的结果,而不是在每行上循环该函数

时间:2017-12-02 00:33:14

标签: r dplyr apply

我有一个分类变量的数据框,看起来像这样(但更长)。

taxTest <- structure(list(Kingdom = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Bacteria", class = "factor"), 
Phylum = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Bacteroidetes", 
"Proteobacteria"), class = "factor"), Class = structure(c(2L, 
1L, 1L, 1L, 1L), .Label = c("Bacteroidia", "Gammaproteobacteria"
), class = "factor"), Order = structure(c(2L, 1L, 1L, 1L, 
1L), .Label = c("Bacteroidales", "Enterobacteriales"), class = "factor"), 
Family = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroidaceae", 
"Enterobacteriaceae", "Prevotellaceae"), class = "factor"), 
Genus = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroides", 
"Escherichia/Shigella", "Prevotella"), class = "factor"), 
Genus.y = structure(c(NA, 1L, 2L, 1L, 2L), .Label = c("Bacteroides", 
"Prevotella"), class = "factor"), Species = structure(c(1L, 
4L, 2L, 5L, 3L), .Label = c("albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris", 
"copri", "disiens", "dorei", "dorei/vulgatus"), class = "factor")), .Names = c("Kingdom", 
"Phylum", "Class", "Order", "Family", "Genus", "Genus.y", "Species"
), row.names = c("tax1", "tax2", "tax3", "tax4", "tax5"), class = "data.frame")

taxTest_output

我想从这个数据中得出一个简短的分类名称,因此我运行的函数比这个稍微复杂一点(它必须处理一堆这些分类级别中的NA数据处理),但是以同样的方式失败。

library(dplyr)

tag_taxon <- function(tvdf){
    species <- tvdf %>% dplyr::select(Species) %>% unlist

    genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
    genus <- genus2 %>% na.omit %>% .[1]

    #genus <- tvdf %>% dplyr::select(Genus) %>% unlist

        out <- paste(genus, species)

out }

如果我对表格的每一行运行此功能,我会得到一个我期待的答案,一个属和物种名称。

for(i in 1:5){
    print(taxTest %>% .[i,] %>% tag_taxon)
}
  

[1]“Escherichia / Shigella albertii / boydii / coli / coli,/ dysenteriae / enterica / fergusonii / flexneri / sonnei / vulneris”

     

[1]“Bacteroides dorei”

     

[1]“Prevotella copri”

     

[1]“Bacteroides dorei / vulgatus”

     

[1]“Prevotella disiens”

我觉得我应该能够使用dplyr在数据帧的每一行上应用此函数。不幸的是,这会产生反直觉的结果。

 taxTest %>% rowwise %>% tag_taxon
  

'Escherichia / Shigella albertii / boydii / coli / coli,/ dysenteriae / enterica / fergusonii / flexneri / sonnei / vulneris'   'Escherichia / Shigella dorei''Escherichia / Shigella copri'   'Escherichia / Shigella dorei / vulgatus''Emcherichia / Shigella disiens'

我想也许apply函数也可以在这里工作,但这只是彻底失败了一个神秘的错误信息。

 apply(taxTest, 1, tag_taxon)
  

UseMethod(“select_”)出错:“select_”没有适用的方法   应用于类“字符”的对象Traceback:

     
      
  1. apply(taxTest,1,tag_taxon)
  2.   
  3. FUN(newX [,i],...)
  4.   
  5. tvdf%&gt;%dplyr :: select(Species)%&gt;%unlist#at file 4 of file
  6.   
  7. withVisible(eval(quote(_fseq_lhs)),env,env))
  8.   
  9. eval(quote(_fseq_lhs)),env,env)
  10.   
  11. eval(quote(_fseq_lhs)),env,env)
  12.   
  13. _fseq_lhs
  14.   
  15. freduce(value,_function_list
  16.   
  17. function_list [I]
  18.   
  19. dplyr :: select(。,Species)
  20.   
  21. select.default(。,Species)
  22.   
  23. 选择_(。data,.dots = compat_as_lazy_dots(...))
  24.   

关于这里发生了什么的任何想法?我可以用for循环完全解决这个问题,但如果可以,我宁愿使用dplyr。

谢谢!

编辑:还有一件事!我忘了在我的原始帖子中提到如果一个人不注释#genus <- tvdf %>% dplyr::select(Genus) %>% unlist行(也就是说,我不尝试将物种信息附加到属信息中),则plyr函数会给出预期的结果。

1 个答案:

答案 0 :(得分:1)

paste是矢量化的,因此不需要按行操作单独的函数。下面的代码要求GenusGenus.y是字符而不是因素,所以我在运行代码之前完成了转换。

taxTest[,c("Genus","Genus.y")] = lapply(taxTest[,c("Genus","Genus.y")] , as.character)

taxTest %>% 
  mutate(tag = gsub("NA ", "", paste(Genus, ifelse(Genus.y==Genus, NA, Genus.y), Species)))

gsub是删除NA加上后面的空格。以下是tag列的内容:

  tag
1 Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris
2                                                                                        Bacteroides dorei
3                                                                                         Prevotella copri
4                                                                               Bacteroides dorei/vulgatus
5                                                                                       Prevotella disiens

要查看原始代码的内容,我们可以向cat添加一些tag_taxon语句。

tag_taxon <- function(tvdf){
  species <- tvdf %>% dplyr::select(Species) %>% unlist

  genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist

  cat("genus2 = ", genus2,"\n")

  genus <- genus2 %>% na.omit %>% .[1]

  cat("genus = ", genus,"\n")

  #genus <- tvdf %>% dplyr::select(Genus) %>% unlist

  out <- paste(genus, species)

  out }

for(i in 1:5){
  print(taxTest %>% .[i,] %>% tag_taxon)
}
genus2 =  Escherichia/Shigella NA 
genus =  Escherichia/Shigella
[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"
genus2 =  Bacteroides Bacteroides 
genus =  Bacteroides 
[1] "Bacteroides dorei"
genus2 =  Prevotella Prevotella 
genus =  Prevotella 
[1] "Prevotella copri"
genus2 =  Bacteroides Bacteroides 
genus =  Bacteroides 
[1] "Bacteroides dorei/vulgatus"
genus2 =  Prevotella Prevotella 
genus =  Prevotella 
[1] "Prevotella disiens"

好的,for循环正在做我们期望的事情。现在是dplyr::rowwise

taxTest %>% rowwise %>% tag_taxon
genus2 =  Escherichia/Shigella Bacteroides Prevotella Bacteroides Prevotella NA Bacteroides Prevotella Bacteroides Prevotella 
genus =  Escherichia/Shigella 
[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"
[2] "Escherichia/Shigella dorei"                                                                              
[3] "Escherichia/Shigella copri"                                                                              
[4] "Escherichia/Shigella dorei/vulgatus"                                                                     
[5] "Escherichia/Shigella disiens"

因此dplyr返回genus2一个向量,其中GenusGenus.y中的所有值都已连接在一起(NA值除外)。然后genus只保留第一个值并反复使用它。这可能与dplyr执行非标准评估的方式有关,但我并不积极。

如果您想使用自己的功能,它将按照by_row包中purrrlyr的预期方式运行:

library(purrrlyr)

taxTest %>% by_row(tag_taxon)