重新安排混合输入数据帧

时间:2017-09-04 08:07:48

标签: r

我的输入数据帧不一致。在这里。

df <- structure(list(Gene = c("k141_1305_1", "k141_1406_2", "k141_1406_3", 
"k141_6669_1", "k141_9215_1", "k141_13242_1", "k141_13333_5", 
"k141_17708_1", "k141_19670_1", "k141_19670_6"), Phylum = c("p__Actinobacteria", 
"p__Firmicutes", "p__Firmicutes", "p__Cyanobacteria", "p__Actinobacteria", 
"p__Actinobacteria", "p__Firmicutes", "p__Firmicutes", "p__Actinobacteria", 
"p__Proteobacteria"), Class = c("c__Actinobacteria", "c__Clostridia", 
"c__Clostridia", "o__Nostocales", "c__Actinobacteria", "c__Actinobacteria", 
"c__Clostridia", "c__Bacilli", "c__Actinobacteria", "c__Gammaproteobacteria"
), Order = c("o__Pseudonocardiales", "o__Clostridiales", "o__Clostridiales", 
"f__Hapalosiphonaceae", "o__Pseudonocardiales", "o__Pseudonocardiales", 
"o__Clostridiales", "o__Bacillales", "o__Pseudonocardiales", 
"o__Pseudomonadales"), Family = c("f__Pseudonocardiaceae", "f__Lachnospiraceae", 
"f__Lachnospiraceae", "g__Fischerella", "f__Pseudonocardiaceae", 
"f__Pseudonocardiaceae", "f__Clostridiales Family XIII. Incertae Sedis", 
"g__Exiguobacterium", "f__Pseudonocardiaceae", "f__Pseudomonadaceae"
), Genus = c("g__Pseudonocardia", "s__Lachnospiraceae bacterium 10-1", 
"s__Lachnospiraceae bacterium 10-1", "s__Fischerella muscicola", 
"g__Pseudonocardia", "g__Pseudonocardia", "s__[Eubacterium] infirmum", 
"s__Exiguobacterium enclense", "g__Pseudonocardia", "g__Pseudomonas"
), Species = c("s__Pseudonocardia sp. Ae331_Ps2", "unknown", 
"unknown", "unknown", "s__Pseudonocardia sp. Ae331_Ps2", "s__Pseudonocardia sp. Ae331_Ps2", 
"unknown", "unknown", "s__Pseudonocardia ammonioxydans", "s__Pseudomonas aeruginosa group"
)), .Names = c("Gene", "Phylum", "Class", "Order", "Family", 
"Genus", "Species"), row.names = c(3212L, 3853L, 3854L, 17967L, 
24006L, 34126L, 34325L, 43722L, 49328L, 49332L), class = "data.frame")

数据框看起来像这样

 Gene            Phylum             Class                Order                Family
3212   k141_1305_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
3853   k141_1406_2     p__Firmicutes     c__Clostridia     o__Clostridiales    f__Lachnospiraceae
3854   k141_1406_3     p__Firmicutes     c__Clostridia     o__Clostridiales    f__Lachnospiraceae
17967  k141_6669_1  p__Cyanobacteria     o__Nostocales f__Hapalosiphonaceae        g__Fischerella
24006  k141_9215_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
34126 k141_13242_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
                                  Genus                         Species
3212                  g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
3853  s__Lachnospiraceae bacterium 10-1                         unknown
3854  s__Lachnospiraceae bacterium 10-1                         unknown
17967          s__Fischerella muscicola                         unknown
24006                 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
34126                 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2

正如您所看到的,数据框架的结构不应该如此。数据框是以这种方式生成的,所以我无法控制它。

问题是微生物应该使用不同的等级注释(从门到物种,每列一个)。正如您在某些情况下可以看到缺少排名,例如Gene 17967(第4行)没有类别排名(没有“c__”注释)。会发生的是,在列类中,此特定分类单元具有顺序(“o__Nostocales”)而不是空的“c__”注释。 对于其他人也是如此,例如第2行没有属性“g__”注释,因此物种被放入属性列中。

第一行和最后两行是它应该如何的一个例子。

是否有机会快速纠正这些行,以便每个列具有相应的分类等级。 ???

例如,如果我采取第二行,正确的输出应该是:

Gene            Phylum             Class                Order                Family 
     3853   k141_1406_2     p__Firmicutes     c__Clostridia     o__Clostridiales    f__Lachnospiraceae
          Genus    Species
    3853  g__ s__Lachnospiraceae bacterium 10-1 

另外,它可能是一个未知的g__unknown标签。

3853   k141_1406_2     p__Firmicutes     c__Clostridia     o__Clostridiales    f__Lachnospiraceae
 3853  g__unknown s__Lachnospiraceae bacterium 10-1 

2 个答案:

答案 0 :(得分:3)

尝试使用此代码:

 adds=function(x){
   nam=c("k","p","c","o","f","g","s")
   l=which(is.na(match(nam,substr(x,1,1))));
   if(length(l)>0)`names<-`(head(unlist(append(x,paste0(nam[l],"__"),l-1)),-1),names(x))
   else x
 }

 data.frame(t(apply(df,1,adds)))

这应该能够在行中附加所需的名称。因此给出了预期的结果。如果这有帮助,请告诉我们。谢谢。

答案 1 :(得分:1)

首先,我们需要认识到当前列没有意义(除了它们的名称),并且前缀带有含义,我们将映射到长列名称。

所以我们构建了一个查找表,然后使用tidyrdplyr

library(dplyr)
library(tidyr)
lkp <- data.frame(key1  = c("Phylum","Class","Order","Family","Genus","Species"),
                  key2 = c("p","c","o","f","g","s"),
                  stringsAsFactors = F)

df %>% gather(key,val,-Gene)          %>%  # put everything in a single column
  filter(val != "unknown")            %>%  # get rid of the unknowns, they don't contain info and have irregular format (no underscore)
  separate(val,c("key2","val2"),sep="__",remove = F) %>% # separate the values, keeping the original
  left_join(lkp)                      %>%  # add info from lookup table
  select(Gene,val,key1)               %>%  # keep only relevant columns
  spread (key1,val, fill = "unknown") %>%  # set back in wide format
  as.data.frame                            # convert from tibble to data.frame

# Gene                  Class                                       Family              Genus                Order            Phylum                           Species
# 1   k141_1305_1      c__Actinobacteria                        f__Pseudonocardiaceae  g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria   s__Pseudonocardia sp. Ae331_Ps2
# 2  k141_13242_1      c__Actinobacteria                        f__Pseudonocardiaceae  g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria   s__Pseudonocardia sp. Ae331_Ps2
# 3  k141_13333_5          c__Clostridia f__Clostridiales Family XIII. Incertae Sedis            unknown     o__Clostridiales     p__Firmicutes         s__[Eubacterium] infirmum
# 4   k141_1406_2          c__Clostridia                           f__Lachnospiraceae            unknown     o__Clostridiales     p__Firmicutes s__Lachnospiraceae bacterium 10-1
# 5   k141_1406_3          c__Clostridia                           f__Lachnospiraceae            unknown     o__Clostridiales     p__Firmicutes s__Lachnospiraceae bacterium 10-1
# 6  k141_17708_1             c__Bacilli                                      unknown g__Exiguobacterium        o__Bacillales     p__Firmicutes       s__Exiguobacterium enclense
# 7  k141_19670_1      c__Actinobacteria                        f__Pseudonocardiaceae  g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria   s__Pseudonocardia ammonioxydans
# 8  k141_19670_6 c__Gammaproteobacteria                          f__Pseudomonadaceae     g__Pseudomonas   o__Pseudomonadales p__Proteobacteria   s__Pseudomonas aeruginosa group
# 9   k141_6669_1                unknown                         f__Hapalosiphonaceae     g__Fischerella        o__Nostocales  p__Cyanobacteria          s__Fischerella muscicola
# 10  k141_9215_1      c__Actinobacteria                        f__Pseudonocardiaceae  g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria   s__Pseudonocardia sp. Ae331_Ps2

如果您想要使用相同的令牌删除前缀,请在最后两个实例中将val替换为val2

df %>% gather(key,val,-Gene) %>%
  filter(val != "unknown") %>%
  separate(val,c("key2","val2"),sep="__",remove = F) %>%
  left_join(lkp) %>%
  select(Gene,val2,key1) %>%
  spread (key1,val2,fill="unknown") %>%
  as.data.frame