将行拆分为R中的列

时间:2015-11-09 09:48:30

标签: r string split row

我有一个包含9列的数据框。第九列有数据混合在一起,我想分开这个:

gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;

所以它看起来像这样:

 gene_id            gene_type gene_status  gene_name     havana_gene 
 ENSG00000243485.3  lincRNA   KNOWN        RP11-34P13.3  OTTHUMG00000000959.2

所以我想用分号分隔符拆分行。

有人可以建议最好的方法吗?我试过了

strsplit(lncRNA.gene$V9,';',fixed=TRUE)

但是我得到了错误

  Error in strsplit(lncRNA.gene$V9, ";", fixed = TRUE) : non-character argument

3 个答案:

答案 0 :(得分:1)

假设您的data.frame类似于:

mydf <- data.frame(id = 1, V9 = "gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;")

(除了只有两列以外)你可以尝试类似的东西:

library(splitstackshape)
library(magrittr)

mydf %>%
  cSplit("V9", ";", "long") %>%          # First, split at the semicolon
  cSplit("V9", " ") %>%                  # Then, split on a space
  dcast(... ~ V9_1, value.var = "V9_2")  # Finally, make the data wide

#    id           gene_id    gene_name gene_status gene_type          havana_gene level        tag
# 1:  1 ENSG00000243485.3 RP11-34P13.3       KNOWN   lincRNA OTTHUMG00000000959.2     2 ncRNA_host

但是,如果在执行第二次拆分时“V9_1”中存在重复项,则默认为制表。在这种情况下,请阅读?getanID的帮助文件,这在这种情况下会有所帮助。

答案 1 :(得分:0)

这是一个不使用外部库的解决方案

mydf <- data.frame(id = 1, V9 = "gene_id ENSG00000243485.3; gene_type lincRNA; gene_status KNOWN; gene_name RP11-34P13.3; level 2; tag ncRNA_host; havana_gene OTTHUMG00000000959.2;")
mydf <- rbind(mydf,mydf)     # rbind the mydf with mydf for 2 observations

#Split the V9 by semicolon
splitBySemiColon <- strsplit(as.character(mydf$V9),';')      

# Removal of whitespaces at start and end of the string
splitBySemiColon <- lapply(splitBySemiColon , function(x) trimws(x,which = c('both')))    

#Split by space to fetch the values and column names separately 
splitBySpace <- lapply(splitBySemiColon ,function(x) strsplit(x,' '))    

# Extraction of column names
colnames <- sapply(splitBySpace[[1]], `[[`, 1)

# data.frame of the values
df <- do.call(rbind,lapply(splitBySpace, function(x) sapply(x, `[[`, 2)))
colnames(df) <- colnames

# Column bind with the other columns
df <- cbind(mydf[,1:2],df)

注意:V9应具有相同数量的字段

答案 2 :(得分:0)

我认为使用strsplit有风险,因为文件格式规范并不要求所有行的gene_type,gene_status等处于相同的顺序。你最好使用旨在解析此类数据的dedicated library

source("http://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
library(rtracklayer)
x<-import("data.gff")
x$gene_name #RP11-34P13.3