我在单独的列中有相对整齐的数据样本,基因,等位基因和频率。对于每个基因和每个样本,我需要将等位基因及其相应的频率分成不同的列。这就是我拥有的和我需要的东西。
尝试使用dplyr / tidyr执行此操作,但我会采取任何我能得到的解决方案。
我有什么:
data.frame(sample=rep("sample1", 10),
gene=rep(paste0("gene", 1:5), each=2),
allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"),
freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))
# sample gene allele freq
# 1 sample1 gene1 A 0.9
# 2 sample1 gene1 G 0.1
# 3 sample1 gene2 A 0.8
# 4 sample1 gene2 C 0.2
# 5 sample1 gene3 A 0.7
# 6 sample1 gene3 T 0.3
# 7 sample1 gene4 C 0.6
# 8 sample1 gene4 G 0.4
# 9 sample1 gene5 G 0.5
# 10 sample1 gene5 T 0.5
我想要的是什么:
data.frame(sample=rep("sample1", 5),
gene=paste0("gene", 1:5),
allele1=c("A", "A", "A", "C", "G"),
allele2=c("G", "C", "T", "G", "T"),
freq1=c(.9, .8, .7, .6, .5),
freq2=c(.1, .2, .3, .4, .5))
# sample gene allele1 allele2 freq1 freq2
# 1 sample1 gene1 A G 0.9 0.1
# 2 sample1 gene2 A C 0.8 0.2
# 3 sample1 gene3 A T 0.7 0.3
# 4 sample1 gene4 C G 0.6 0.4
# 5 sample1 gene5 G T 0.5 0.5
答案 0 :(得分:7)
您可以使用dcast
的开发版本中的data.table
即可。 1.9.5+,可以使用多个value.var
列。我们创建了一个序列列(' indx'),按照'样本'和'基因'。然后dcast
从长到宽的格式提到value.var
列。
library(data.table)#v1.9.5+
setDT(df)[, indx:=1:.N,.(sample, gene)]
dcast(df, sample+gene~indx, value.var=c('allele', 'freq'), sep= '')
# sample gene allele1 allele2 freq1 freq2
#1: sample1 gene1 A G 0.9 0.1
#2: sample1 gene2 A C 0.8 0.2
#3: sample1 gene3 A T 0.7 0.3
#4: sample1 gene4 C G 0.6 0.4
#5: sample1 gene5 G T 0.5 0.5
注意:安装devel版本的说明是here
sep=''
参数可用于创建列名称为' allele1',' allele2'等于默认值为`allele_1',' allele_2'等(来自@ Arun的评论)
答案 1 :(得分:5)
这使用汇总而不是真正的重塑,但可能适合该法案。
library(dplyr)
foo <- data.frame(sample=rep("sample1", 10),
gene=rep(paste0("gene", 1:5), each=2),
allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"),
freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))
foo %>%
group_by(sample, gene) %>%
summarise(allele1 = first(allele), allele2 = last(allele),
freq1 = first(freq), freq2 = last(freq))
## Source: local data frame [5 x 6]
## Groups: sample
##
## sample gene allele1 allele2 freq1 freq2
## 1 sample1 gene1 A G 0.9 0.1
## 2 sample1 gene2 A C 0.8 0.2
## 3 sample1 gene3 A T 0.7 0.3
## 4 sample1 gene4 C G 0.6 0.4
## 5 sample1 gene5 G T 0.5 0.5
答案 2 :(得分:1)
You can join the table on itself and then select the appropriate rows.
library(dplyr)
table <- data.frame(sample=rep("sample1", 10),
gene=rep(paste0("gene", 1:5), each=2),
allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"),
freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))
inner_join(table, table, by=c("sample","gene")) %>%
filter(allele.x != allele.y,
(freq.x > freq.y | (freq.x == freq.y & as.numeric(allele.x) < as.numeric(allele.y))))
答案 3 :(得分:0)
如果数据在样本和基因列上排序,并且每个基因有2行,那么我们可以尝试以下:
cbind( df[ seq(1,nrow(df),2), ],
df[ seq(2,nrow(df),2), -c(1,2) ] )
#output
# sample gene allele freq allele freq
# 1 sample1 gene1 A 0.9 G 0.1
# 3 sample1 gene2 A 0.8 C 0.2
# 5 sample1 gene3 A 0.7 T 0.3
# 7 sample1 gene4 C 0.6 G 0.4
# 9 sample1 gene5 G 0.5 T 0.5
答案 4 :(得分:0)
我自己也在努力解决这个问题。经典的重塑可以说是最清晰的吗?
df <- data.frame(sample=rep("sample1", 10),
gene=rep(paste0("gene", 1:5), each=2),
allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"),
freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))
# make a time index
df$time = as.numeric( duplicated( df$gene ) ) # rep( c(1,2), nrow(df)/2 )
# reshape
reshape( df, idvar="gene", v.names=c("allele", "freq"), timevar="time", direction="wide")
但我真的想从整齐的方面得到一个干净的答案!