我的输入数据帧不一致。在这里。
df <- structure(list(Gene = c("k141_1305_1", "k141_1406_2", "k141_1406_3",
"k141_6669_1", "k141_9215_1", "k141_13242_1", "k141_13333_5",
"k141_17708_1", "k141_19670_1", "k141_19670_6"), Phylum = c("p__Actinobacteria",
"p__Firmicutes", "p__Firmicutes", "p__Cyanobacteria", "p__Actinobacteria",
"p__Actinobacteria", "p__Firmicutes", "p__Firmicutes", "p__Actinobacteria",
"p__Proteobacteria"), Class = c("c__Actinobacteria", "c__Clostridia",
"c__Clostridia", "o__Nostocales", "c__Actinobacteria", "c__Actinobacteria",
"c__Clostridia", "c__Bacilli", "c__Actinobacteria", "c__Gammaproteobacteria"
), Order = c("o__Pseudonocardiales", "o__Clostridiales", "o__Clostridiales",
"f__Hapalosiphonaceae", "o__Pseudonocardiales", "o__Pseudonocardiales",
"o__Clostridiales", "o__Bacillales", "o__Pseudonocardiales",
"o__Pseudomonadales"), Family = c("f__Pseudonocardiaceae", "f__Lachnospiraceae",
"f__Lachnospiraceae", "g__Fischerella", "f__Pseudonocardiaceae",
"f__Pseudonocardiaceae", "f__Clostridiales Family XIII. Incertae Sedis",
"g__Exiguobacterium", "f__Pseudonocardiaceae", "f__Pseudomonadaceae"
), Genus = c("g__Pseudonocardia", "s__Lachnospiraceae bacterium 10-1",
"s__Lachnospiraceae bacterium 10-1", "s__Fischerella muscicola",
"g__Pseudonocardia", "g__Pseudonocardia", "s__[Eubacterium] infirmum",
"s__Exiguobacterium enclense", "g__Pseudonocardia", "g__Pseudomonas"
), Species = c("s__Pseudonocardia sp. Ae331_Ps2", "unknown",
"unknown", "unknown", "s__Pseudonocardia sp. Ae331_Ps2", "s__Pseudonocardia sp. Ae331_Ps2",
"unknown", "unknown", "s__Pseudonocardia ammonioxydans", "s__Pseudomonas aeruginosa group"
)), .Names = c("Gene", "Phylum", "Class", "Order", "Family",
"Genus", "Species"), row.names = c(3212L, 3853L, 3854L, 17967L,
24006L, 34126L, 34325L, 43722L, 49328L, 49332L), class = "data.frame")
数据框看起来像这样
Gene Phylum Class Order Family
3212 k141_1305_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
3853 k141_1406_2 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
3854 k141_1406_3 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
17967 k141_6669_1 p__Cyanobacteria o__Nostocales f__Hapalosiphonaceae g__Fischerella
24006 k141_9215_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
34126 k141_13242_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
Genus Species
3212 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
3853 s__Lachnospiraceae bacterium 10-1 unknown
3854 s__Lachnospiraceae bacterium 10-1 unknown
17967 s__Fischerella muscicola unknown
24006 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
34126 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
正如您所看到的,数据框架的结构不应该如此。数据框是以这种方式生成的,所以我无法控制它。
问题是微生物应该使用不同的等级注释(从门到物种,每列一个)。正如您在某些情况下可以看到缺少排名,例如Gene 17967(第4行)没有类别排名(没有“c__”注释)。会发生的是,在列类中,此特定分类单元具有顺序(“o__Nostocales”)而不是空的“c__”注释。 对于其他人也是如此,例如第2行没有属性“g__”注释,因此物种被放入属性列中。
第一行和最后两行是它应该如何的一个例子。
是否有机会快速纠正这些行,以便每个列具有相应的分类等级。 ???
例如,如果我采取第二行,正确的输出应该是:
Gene Phylum Class Order Family
3853 k141_1406_2 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
Genus Species
3853 g__ s__Lachnospiraceae bacterium 10-1
另外,它可能是一个未知的g__unknown标签。
3853 k141_1406_2 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
3853 g__unknown s__Lachnospiraceae bacterium 10-1
答案 0 :(得分:3)
尝试使用此代码:
adds=function(x){
nam=c("k","p","c","o","f","g","s")
l=which(is.na(match(nam,substr(x,1,1))));
if(length(l)>0)`names<-`(head(unlist(append(x,paste0(nam[l],"__"),l-1)),-1),names(x))
else x
}
data.frame(t(apply(df,1,adds)))
这应该能够在行中附加所需的名称。因此给出了预期的结果。如果这有帮助,请告诉我们。谢谢。
答案 1 :(得分:1)
首先,我们需要认识到当前列没有意义(除了它们的名称),并且前缀带有含义,我们将映射到长列名称。
所以我们构建了一个查找表,然后使用tidyr
和dplyr
library(dplyr)
library(tidyr)
lkp <- data.frame(key1 = c("Phylum","Class","Order","Family","Genus","Species"),
key2 = c("p","c","o","f","g","s"),
stringsAsFactors = F)
df %>% gather(key,val,-Gene) %>% # put everything in a single column
filter(val != "unknown") %>% # get rid of the unknowns, they don't contain info and have irregular format (no underscore)
separate(val,c("key2","val2"),sep="__",remove = F) %>% # separate the values, keeping the original
left_join(lkp) %>% # add info from lookup table
select(Gene,val,key1) %>% # keep only relevant columns
spread (key1,val, fill = "unknown") %>% # set back in wide format
as.data.frame # convert from tibble to data.frame
# Gene Class Family Genus Order Phylum Species
# 1 k141_1305_1 c__Actinobacteria f__Pseudonocardiaceae g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria s__Pseudonocardia sp. Ae331_Ps2
# 2 k141_13242_1 c__Actinobacteria f__Pseudonocardiaceae g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria s__Pseudonocardia sp. Ae331_Ps2
# 3 k141_13333_5 c__Clostridia f__Clostridiales Family XIII. Incertae Sedis unknown o__Clostridiales p__Firmicutes s__[Eubacterium] infirmum
# 4 k141_1406_2 c__Clostridia f__Lachnospiraceae unknown o__Clostridiales p__Firmicutes s__Lachnospiraceae bacterium 10-1
# 5 k141_1406_3 c__Clostridia f__Lachnospiraceae unknown o__Clostridiales p__Firmicutes s__Lachnospiraceae bacterium 10-1
# 6 k141_17708_1 c__Bacilli unknown g__Exiguobacterium o__Bacillales p__Firmicutes s__Exiguobacterium enclense
# 7 k141_19670_1 c__Actinobacteria f__Pseudonocardiaceae g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria s__Pseudonocardia ammonioxydans
# 8 k141_19670_6 c__Gammaproteobacteria f__Pseudomonadaceae g__Pseudomonas o__Pseudomonadales p__Proteobacteria s__Pseudomonas aeruginosa group
# 9 k141_6669_1 unknown f__Hapalosiphonaceae g__Fischerella o__Nostocales p__Cyanobacteria s__Fischerella muscicola
# 10 k141_9215_1 c__Actinobacteria f__Pseudonocardiaceae g__Pseudonocardia o__Pseudonocardiales p__Actinobacteria s__Pseudonocardia sp. Ae331_Ps2
如果您想要使用相同的令牌删除前缀,请在最后两个实例中将val
替换为val2
。
df %>% gather(key,val,-Gene) %>%
filter(val != "unknown") %>%
separate(val,c("key2","val2"),sep="__",remove = F) %>%
left_join(lkp) %>%
select(Gene,val2,key1) %>%
spread (key1,val2,fill="unknown") %>%
as.data.frame