我花了很多时间试图解决这个问题并且没有成功。
我有一个data.frame,其中一列包含可变长度的字符串。 data.frame看起来像这样:
Taxa <- as.character(c("cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)_Klebsiella(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)_Klebsiella(genus)_Klebsiellapneumoniae(species)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)_Clostridiumbotulinum(species)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)_Clostridiumbotulinum(species)_ClostridiumbotulinumCDC66177(strain)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)_Microbacterium(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)_Microbacterium(genus)_Microbacteriumlaevaniformans(species)_MicrobacteriumlaevaniformansOR221(strain)"))
Percent <- c("0.000400","0.006800","0.005034","0.001760","0.000000","0.000000","0.344400","0.000000","0.000000","0.000000","0.006500","0.002819","0.000487","0.000000","0.001090")
Test <- data.frame(Percent, Taxa)
Test$Taxa <- as.character(Test$Taxa)
我可以将下划线上的这些字符串子集化为不等长度的列表:
NewDF <- strsplit(Test$Taxa, "_", fixed=TRUE)
但是我无法弄清楚如何将这个解析后的输出格式化为可用的结构。
每个解析的部分都有两个组成部分,一个描述符和一个分类级别(即细菌(superkingdom)是描述符细菌和分类级别的超级运动。
我想要做的是获取此解析后的输出,并填充具有以下列标题的数据框(norank,superkingdom,phylum,class,order,family,genus,species,strain)。输出需要跳过上面列表中未包含的分类级别(例如,在类和顺序之间存在具有子类的分类级别的行,我需要删除子类)。
此外,如果一条线在特定的分类水平停止并且仍有未填充的列,则它们应设置为NA(即第一行在门处结束,因此类,顺序,族等应为NA)。
最终输出应如下所示:
norank superkingdom phylum class order family genus species strain
1 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) <NA> <NA> <NA> <NA> <NA> <NA>
2 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) <NA> <NA> <NA>
3 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) <NA> <NA> <NA> <NA>
4 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) Klebsiella(genus) <NA>
5 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) Klebsiella(genus) Klebsiellapneumoniae(species) <NA>
6 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) <NA> <NA> <NA> <NA>
7 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) <NA> <NA> <NA> <NA> <NA>
8 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) <NA> <NA> <NA>
9 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) <NA> <NA>
10 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) Clostridiumbotulinum(species) <NA>
11 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) Clostridiumbotulinum(species) ClostridiumbotulinumCDC66177(strain)
12 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) <NA> <NA> <NA> <NA>
13 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) <NA> <NA> <NA>
14 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) Microbacterium(genus) <NA> <NA>
15 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) Microbacterium(genus) Microbacteriumlaevaniformans(species) MicrobacteriumlaevaniformansOR221(strain)
答案 0 :(得分:3)
您可以尝试通过将小型data.frames列表编译为一个df
来实现library(dplyr)
NewDF <-
lapply(strsplit(Test$Taxa, "_", fixed=TRUE),
function(x)
{
vars <- lapply(x, function(y)
{
m <- regexec("\\((.+?)\\)",y)
regmatches(y,m)[[1]][2]
})
vals <- as.list( x )
names(vals) <- unlist(vars)
data.frame( vals,
stringsAsFactors = FALSE )
}) %>% rbind_all
它给了我你想要的结果(也有漂亮的变量名)