需要传播在r中具有字符和数字值的data.frame

时间:2018-06-19 18:40:51

标签: r dplyr tidyr tidyverse

我在列表中有几个数据框,并且我试图使用purrr中的map将它们加入。我最后使用full_join是因为它们没有相同的列,例如,我可以使用以下内容作为示例:

DF1 <- structure(list(scrubbed_species_binomial = c("Solanum montanum", 
"Solanum montanum", "Solanum montanum"), trait_name = c("whole plant woodiness", 
"whole plant growth form", "whole plant growth form diversity"
), trait_value = c("herbaceous", "Herb", "Herb")), row.names = c(NA, 
3L), class = "data.frame", .Names = c("scrubbed_species_binomial", 
"trait_name", "trait_value"))

当我使用tidyverse分离特征时

DF1 %>% distinct() %>% spread(trait_name, trait_value)

我得到以下内容

scrubbed_species_binomial whole plant growth form whole plant growth form diversity whole plant woodiness
1          Solanum montanum                    Herb                              Herb            herbaceous

问题

某些data.frame具有重复的数字特征,但是由于特征值是一个字符列,因此当我尝试使用spreadsummarize_if(is.numeric)时会出错

第一次尝试

DF2 <- structure(list(scrubbed_species_binomial = c("Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum"), trait_name = c("whole plant growth form diversity", 
"leaf area per leaf dry mass", "leaf dry mass per leaf fresh mass", 
"leaf dry mass per leaf fresh mass", "leaf thickness", "leaf dry mass per leaf fresh mass", 
"leaf thickness", "leaf thickness", "leaf area per leaf dry mass", 
"leaf area per leaf dry mass", "whole plant growth form"), trait_value = 
c("Herb", 
"1.84229918938836e-05", "1.913", "2.166", "0.2506", "1.898", 
"0.2358", "0.2535", "2.21729490022173e-05", "2.07770621234157e-05", 
"Herb")), row.names = c(NA, 11L), class = "data.frame", .Names = 
c("scrubbed_species_binomial", 
"trait_name", "trait_value"))

第一次尝试

当我尝试此操作时:

DF2 %>% distinct() %>% spread(trait_name, trait_value)

我收到以下错误

Error: Duplicate identifiers for rows (2, 9, 10), (3, 4, 6), (5, 7, 8)

如果我尝试总结数字是否也不起作用

让我知道该怎么办

2 个答案:

答案 0 :(得分:0)

我认为这不是一个很好的解决方案,因为它会在没有关联的情况下创建外观,但这可能会为您提供所需的东西。就像@camille所说的一样,只是充实了。

DF2 %>% 
  group_by(trait_name) %>% 
  mutate(id = 1:n()) %>% 
  spread(trait_name, trait_value, convert = TRUE) %>%
  fill(everything())

答案 1 :(得分:0)

您需要告诉spread,重复的trait_name应该放在不同的行中,因此在row_number()之前添加spread,您应该都准备好了。

library(dplyr)
library(tidyr)

df2 %>%
  group_by(scrubbed_species_binomial, trait_name) %>%
  mutate(row_idx = row_number()) %>%
  spread(trait_name, trait_value)

给出

  scrubbed_species~ row_idx `leaf area per l~ `leaf dry mass ~ `leaf thickness` `whole plant gr~ `whole plant gr~
  <chr>               <int> <chr>             <chr>            <chr>            <chr>            <chr>           
1 Solanum peruvian~       1 1.84229918938836~ 1.913            0.2506           Herb             Herb            
2 Solanum peruvian~       2 2.21729490022173~ 2.166            0.2358           <NA>             <NA>            
3 Solanum peruvian~       3 2.07770621234157~ 1.898            0.2535           <NA>             <NA> 


示例数据

df2 <- structure(list(scrubbed_species_binomial = c("Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum", "Solanum peruvianum", "Solanum peruvianum", 
"Solanum peruvianum"), trait_name = c("whole plant growth form diversity", 
"leaf area per leaf dry mass", "leaf dry mass per leaf fresh mass", 
"leaf dry mass per leaf fresh mass", "leaf thickness", "leaf dry mass per leaf fresh mass", 
"leaf thickness", "leaf thickness", "leaf area per leaf dry mass", 
"leaf area per leaf dry mass", "whole plant growth form"), trait_value = c("Herb", 
"1.84229918938836e-05", "1.913", "2.166", "0.2506", "1.898", 
"0.2358", "0.2535", "2.21729490022173e-05", "2.07770621234157e-05", 
"Herb")), .Names = c("scrubbed_species_binomial", "trait_name", 
"trait_value"), row.names = c(NA, 11L), class = "data.frame")