我有一个包含9列的数据框。第九列有数据混合在一起,我想分开这个:
gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;
所以它看起来像这样:
gene_id gene_type gene_status gene_name havana_gene
ENSG00000243485.3 lincRNA KNOWN RP11-34P13.3 OTTHUMG00000000959.2
所以我想用分号分隔符拆分行。
有人可以建议最好的方法吗?我试过了
strsplit(lncRNA.gene$V9,';',fixed=TRUE)
但是我得到了错误
Error in strsplit(lncRNA.gene$V9, ";", fixed = TRUE) : non-character argument
答案 0 :(得分:1)
假设您的data.frame
类似于:
mydf <- data.frame(id = 1, V9 = "gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;")
(除了只有两列以外)你可以尝试类似的东西:
library(splitstackshape)
library(magrittr)
mydf %>%
cSplit("V9", ";", "long") %>% # First, split at the semicolon
cSplit("V9", " ") %>% # Then, split on a space
dcast(... ~ V9_1, value.var = "V9_2") # Finally, make the data wide
# id gene_id gene_name gene_status gene_type havana_gene level tag
# 1: 1 ENSG00000243485.3 RP11-34P13.3 KNOWN lincRNA OTTHUMG00000000959.2 2 ncRNA_host
但是,如果在执行第二次拆分时“V9_1”中存在重复项,则默认为制表。在这种情况下,请阅读?getanID
的帮助文件,这在这种情况下会有所帮助。
答案 1 :(得分:0)
这是一个不使用外部库的解决方案
mydf <- data.frame(id = 1, V9 = "gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;")
mydf <- rbind(mydf,mydf) # rbind the mydf with mydf for 2 observations
#Split the V9 by semicolon
splitBySemiColon <- strsplit(as.character(mydf$V9),';')
# Removal of whitespaces at start and end of the string
splitBySemiColon <- lapply(splitBySemiColon , function(x) trimws(x,which = c('both')))
#Split by space to fetch the values and column names separately
splitBySpace <- lapply(splitBySemiColon ,function(x) strsplit(x,' '))
# Extraction of column names
colnames <- sapply(splitBySpace[[1]], `[[`, 1)
# data.frame of the values
df <- do.call(rbind,lapply(splitBySpace, function(x) sapply(x, `[[`, 2)))
colnames(df) <- colnames
# Column bind with the other columns
df <- cbind(mydf[,1:2],df)
注意:V9应具有相同数量的字段
答案 2 :(得分:0)
我认为使用strsplit有风险,因为文件格式规范并不要求所有行的gene_type,gene_status等处于相同的顺序。你最好使用旨在解析此类数据的dedicated library
source("http://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
library(rtracklayer)
x<-import("data.gff")
x$gene_name #RP11-34P13.3