我有一个带有一些列的大<div class="rotate"></div>
,但我的第9列是由以分号分隔的数据组成的:
data.frame
所以我想将此列切换到其他列, gtf$V9
1 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
稍后将其与merge
的其他部分(第9列之前的其他列)一起删除。
我尝试了一些没有结果的代码:
data.frame
或
head(gtf$V9, sep = ";",stringsAsFactors = FALSE)
与new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"
,as.data.frame
或data.frame
我还尝试as.matrix
并使用包含write.csv
导入此内容,但sep=";"
太大而我的计算机滞后..
有什么建议吗?
答案 0 :(得分:3)
另一种选择是使用splitstackshape
- 包(也加载data.table
)。使用:
library(splitstackshape)
cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'),
'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]
给出:
V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1
答案 1 :(得分:1)
您可以在strsplit()
sapply()
进行操作
如果您知道V9中有多少个对象,那么可以在其上进行for循环
for (i in 1:number_of_max_objects_in_V9) {
gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i])
}
如果您不知道V9可以拥有多少个对象,那么只需在gtf $ V9中的str_count
上运行,
,就像这样:
library(stringr)
number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))
答案 2 :(得分:1)
# example dataset (only variable of interest included)
df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"),
stringsAsFactors = F)
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>% # flag row ids (will need those to reshape data later)
separate_rows(V9, sep="; ") %>% # split strings and create new rows
separate(V9, c("name","value"), sep=" ") %>% # separate column name from value
mutate(value = gsub(";","",value)) %>% # remove ; when necessary
spread(name, value) # reshape data
# id conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id
# 1 1 9.805420 4.347062 25.616962 <NA> 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 2 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 3 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 4 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1
您可以使用行ID(id
)将此数据集连接回初始数据集。您还需要在原始数据集中创建id
。