我想折叠以下数据框
df
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92
通过仅匹配具有相同Cn和mCn值的连续行,我想要折叠行。例如,对于前4行,我们有以下内容:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
我想要折叠连续的行,它们具有相同的Cn和mCn分数,因此前三行各有一个" 2"和" 1"分别在Cn和mCn列上,还可以更改End列以反映此崩溃。
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
但我还想更改Cn_column
,以便加权平均值Cn_dependant
取决于lengthMB
得分对该行的影响。因此对于前三行,计算将是:
((8.023/9.599) * 1.99) + ((1.145/9.599) * 2) + ((0.431/9.599) * 1.91) = 1.987
前四个独特染色体的输出:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9598904 1.145 0.030441784 0.000601976 2 1 1.79
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9598904 8.185 0.033499045 -0.031338811 1 0 0.836
chr3 9598904 35402000 0.883 0.049003807 -0.021413391 2 1 2.02
chr4 0 9598904 11.065 0.035092332 -0.022844471 2 1 2
尝试过这样的事情,但我也不知道如何计算......
squish_segments <- function(sample) {
setDT(sample)[, .ind:= cumsum(c(TRUE,Start[-1]!=End[-.N])),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_)][,
list(Chr=Chromosome[1], Start=Start[1], End=End[.N]),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_, .ind)][,.ind:=NULL][]
}
答案 0 :(得分:1)
首先,请提供数据集的dput
输出,以使您的问题更具可重复性。
我认为这是你想要的低级别。
setkey(df, Chromosome, Cn, mCn, Start)
df[, list(
Start=min(Start),
End=max(End),
lengthMB=lengthMB[1],
imba=imba[1],
log2=log2[1],
Cn_=weighted.mean(Cn_, lengthMB)
), keyby=list(Chromosome, Cn , mCn)]
答案 1 :(得分:1)
这是dplyr
方法。
library(dplyr)
df = read.table(text=
"Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92", header=T)
df %>%
mutate(Consec = ifelse(Chromosome == dplyr::lag(Chromosome, default = Chromosome[1]) & ## flag consecutive matching chromosomes
Cn == dplyr::lag(Cn, default = Cn[1]) &
mCn == dplyr::lag(mCn, default = mCn[1]), 0, 1),
Consec = cumsum(Consec)) %>% ## create an id for consecutive matching chromosomes
group_by(Chromosome, Cn, mCn, Consec) %>%
summarize(Cn_ = sum(lengthMB * Cn_)/sum(lengthMB),
Start = min(Start),
End = max(End),
lengthMB = first(lengthMB),
imba= first(imba),
log2= first(log2)) %>%
ungroup() %>% ## only if you want to ungroup
select(Chromosome,Start,End, lengthMB,imba,log2,Cn,mCn,Cn_) %>% ## to re arrange column order
arrange(Chromosome, Start)
# Chromosome Start End lengthMB imba log2 Cn mCn Cn_
# (fctr) (int) (int) (dbl) (dbl) (dbl) (int) (int) (dbl)
# 1 chr1 0 9598904 8.023 0.02690512 -0.001671481 2 1 1.9876008
# 2 chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000
# 3 chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000
# 4 chr2 8022930 9598904 1.145 0.03044178 0.000601976 2 1 1.7879569
# 5 chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000
# 6 chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000
# 7 chr3 0 8040000 1.479 NA 0.000969256 2 1 2.0000000
# 8 chr3 8040000 9598904 8.185 0.03349904 -0.031338811 1 0 0.8541823
# 9 chr3 9598904 35402000 0.883 0.04900381 -0.021413391 2 1 2.0187143
# 10 chr4 0 9598904 11.065 0.03509233 -0.022844471 2 1 1.9956599
请注意,lag
是dplyr
函数,但也是stats
包函数。我必须写dplyr::lag
,否则当我尝试在default =
中指定lag
时会发生冲突。我不知道你或其他任何人是否可以复制这个问题。
答案 2 :(得分:0)
可以识别独特的&#34;事件&#34; (具有相同Cn和mCn得分的连续行)然后简单地循环遍历这些事件并相应地修改行。不是最有效但应该做的工作。
txt <- "Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92"
df <- read.table(text=txt, header=T)
#identify each unique event
df$eventid <- with(df, cumsum(c(1,diff(as.numeric(factor(Chromosome)))!=0 | diff(Cn)!=0 | diff(mCn)!=0)))
#loop through events
for(i in 1:max(df$eventid)){
#identify rows in df with ith event
rows.i <- which(df$eventid == i)
df[rows.i,] <- within(df[rows.i,],{
#calculate values of interest and assign to first row of event
Start[1] <- min(Start)
End[1] <- max(End)
Cn_[1] <- sum((lengthMB/sum(lengthMB))*Cn_)
lengthMB[1] <- sum(lengthMB)
})
#drop all but first row
if(length(rows.i) > 1) df <- df[-rows.i[-1],]
} #end i
结果
> df
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ eventid
1 chr1 8022945 9598904 1.576 0.03044178 0.000601976 2 1 1.9753871 1
3 chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000 2
4 chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000 3
5 chr2 8022930 9598904 1.576 0.03044178 0.000601976 2 1 1.7879569 4
7 chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000 5
8 chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000 6
9 chr3 0 8040000 1.479 NA 0.000969256 2 1 2.0000000 7
10 chr3 8040000 9598904 12.137 0.03349904 -0.031338811 1 0 0.8541823 8
12 chr3 9598904 35402000 4.978 0.04900381 -0.021413391 2 1 2.0187143 9
14 chr4 0 9598904 52.245 0.03509233 -0.022844471 2 1 1.9956599 10
答案 3 :(得分:0)
如果我正确理解了您的问题,您可以通过data.table
快速分组在一行中完成。
library(data.table)
dt[, Cn_dependent := sum((lengthMB/sum(lengthMB)) * Cn_),
by = .(Chromosome, Cn, mCn)]
得到这个:
> dt
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 8022945 9168284 1.145 0.03044178 0.000601976 2 1 2.00 1.987601
3: chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 1.987601
4: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
6: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
7: chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 1.882285
8: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
9: chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.01 1.882285
要按Chromosome
,Cn
和mCn
折叠,您可以使用密钥和unique
。
> setkey(dt, "Chromosome", "Cn", "mCn")
> unique(dt)
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
3: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
4: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
以下是我开始使用的dput
的{{1}}:
data.table