在数据框中排列染色体数

时间:2019-01-04 07:20:21

标签: r

我有一个包含样本染色体及其频率的文件:

 a
 sample   Chr_No   frequency
 sample-1  chr1:         0
 sample-1  chr2:         0
 sample-1  chr3:         0
 sample-1  chr4:         1
 sample-1  chr5:         0
 sample-1  chr6:         0
 sample-1  chr7:         0
 sample-1  chr8:         0
 sample-1  chr9:         1
 sample-1  chr10         0
 sample-1  chr11         0
 ......

我想将其转换为数据帧,所以我正在R中使用它:

 b <- dcast( a, Sample ~ Chr_No, value.var = "Frequency", fill = 0 )

This command is creating data frame but the arrangement of chromosome is different:

我如何从Chr_No中删除“:”并将Chr_No安排为Chr1 Chr2 Chr3 .......在数据框中?

2 个答案:

答案 0 :(得分:1)

首先从名称中删除冒号,然后使用mixedsort将名称排列为chr1chr2

library(gtools)

names(b) <- sub(":", "", names(b))
cbind(b[1], b[-1][mixedsort(names(b[-1]))])


#    sample chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
#1 sample-1    0    0    0    1    0    0    0    0    1     0     0

或者我们可以将所有内容都保留在基数R中,并从names中删除所有字符,仅保留数字并在删除冒号后order对其进行修饰

cbind(b[1], b[-1][order(as.numeric(gsub("[[:alpha:]]", "", names(b[-1]))))])


#    sample chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
#1 sample-1    0    0    0    1    0    0    0    0    1     0     0

答案 1 :(得分:0)

order之前dcast的另一种选择是在删除字符串末尾的factor之后将其更改为levels的{​​{1}}列在“ Chr_No”

:

然后,执行library(data.table) setDT(a)[, Chr_No := factor(sub(':$', '', Chr_No), levels = paste0("chr", 1:11))]

dcast

数据

dcast( a, sample ~ Chr_No, value.var = "frequency", fill = 0 )
#     sample chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
#1: sample-1    0    0    0    1    0    0    0    0    1     0     0