我想将下面的列转换为下面的格式。重新格式化的方式是样本在样本类型N之间分组。例如,下面的前两行组合在一起,7397-DNA_A01到7399-DNA_A01组合在一起。
Sample Sample Type
7393.DNA_A01 N
7394-DNA_A01 T
7395-DNA_A01 N
7396-DNA_A01 T
7397-DNA_A01 N
7398-DNA_A01 T
7399-DNA_A01 LN
7400-DNA_A01 N
7401-DNA_A01 T
7402-DNA_A01 B
desired output
N T B LN
7393.DNA_A01 7394-DNA_A01
7395-DNA_A01 7396-DNA_A01
7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
我真的不确定在遇到N时如何分割行,然后我想我需要以某种方式进行转置。请帮忙!
答案 0 :(得分:1)
我们需要根据'N'的出现创建一个分组索引('indx')。在这里,创建了一个逻辑向量(SampleType=='N'
)和cumsum
来创建'indx'。根据列的顺序,将“SampleType”列更改为factor
并按预期结果中列名称的顺序指定级别可能很有用。然后,我们可以使用dcast
或reshape2
中的data.table
。
library(data.table)#v1.9.5+
setDT(df1)[, indx:=cumsum(SampleType=='N')
][, SampleType:= factor(SampleType, levels=c('N', 'T', 'B', 'LN'))]
dcast(df1, indx~SampleType, value.var='Sample', fill='')[,-1,with=FALSE]
# N T B LN
#1: 7393.DNA_A01 7394-DNA_A01
#2: 7395-DNA_A01 7396-DNA_A01
#3: 7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
#4: 7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
如果您使用dcast
中的reshape2
,则可以通过base R
选项创建'indx'列。您还可以使用类似的代码将“SampleType”列更改为factor
。
df1$indx <- cumsum(df1$SampleType=='N')
library(reshape2)
dcast(df1, indx~SampleType, value.var='Sample', fill='')
df1 <- structure(list(Sample = c("7393.DNA_A01", "7394-DNA_A01",
"7395-DNA_A01",
"7396-DNA_A01", "7397-DNA_A01", "7398-DNA_A01", "7399-DNA_A01",
"7400-DNA_A01", "7401-DNA_A01", "7402-DNA_A01"), SampleType = c("N",
"T", "N", "T", "N", "T", "LN", "N", "T", "B")), .Names = c("Sample",
"SampleType"), class = "data.frame", row.names = c(NA, -10L))