我的数据按V6中的ID分组,按位置排序(V1:V3):
dt
V1 V2 V3 V4 V5 V6
1: chr1 3054233 3054733 . + ENSMUSG00000090025
2: chr1 3102016 3102125 . + ENSMUSG00000064842
3: chr1 3205901 3207317 . - ENSMUSG00000051951
4: chr1 3206523 3207317 . - ENSMUSG00000051951
5: chr1 3213439 3215632 . - ENSMUSG00000051951
6: chr1 3213609 3216344 . - ENSMUSG00000051951
7: chr1 3214482 3216968 . - ENSMUSG00000051951
8: chr1 3421702 3421901 . - ENSMUSG00000051951
9: chr1 3466587 3466687 . + ENSMUSG00000089699
10: chr1 3513405 3513553 . + ENSMUSG00000089699
我想要做的是添加一个带位置索引的额外列,也就是说,每个组在V6中第一个元素是“1”,第二个元素是“2”,依此类推。我可以使用ddply和自定义函数来实现:
rankExons <- function(x){
if(unique(x$V5) == "+"){
x$index <- seq_len(nrow(x))}
else{
x$index <- rev(seq_len(nrow(x)))}
x
}
indexed <- ddply(dt, .(V6), rankExons)
indexed
V1 V2 V3 V4 V5 V6 index
1 chr1 3205901 3207317 . - ENSMUSG00000051951 6
2 chr1 3206523 3207317 . - ENSMUSG00000051951 5
3 chr1 3213439 3215632 . - ENSMUSG00000051951 4
4 chr1 3213609 3216344 . - ENSMUSG00000051951 3
5 chr1 3214482 3216968 . - ENSMUSG00000051951 2
6 chr1 3421702 3421901 . - ENSMUSG00000051951 1
7 chr1 3102016 3102125 . + ENSMUSG00000064842 1
8 chr1 3466587 3466687 . + ENSMUSG00000089699 1
9 chr1 3513405 3513553 . + ENSMUSG00000089699 2
10 chr1 3054233 3054733 . + ENSMUSG00000090025 1
不幸的是,它在整个数据集(~620k行)上非常慢,并且当使用并行时它会崩溃和烧伤:
library(doMC)
registerDoMC(cores=6)
indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE)
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Warning message:
In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :
all scheduled cores encountered errors in user code
所以,我去了data.table但是无法让它工作。这是我试过的:
setkey(dt, "V6")
dt[,index:=rankExons(dt), by=V6]
dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")]
两者都失败了。如何使用data.table重新创建ddply?
dput(dt)
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L,
3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L,
3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L,
3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".",
".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+",
"-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025",
"ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951",
"ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951",
"ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table",
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>)
答案 0 :(得分:18)
作为生物信息学家,我经常遇到这种操作。这是我崇拜data.table
的修改行的子集功能的地方!
我会这样做:
dt[V5 == "+", index := 1:.N, by=V6]
dt[V5 == "-", index := .N:1, by=V6]
无需任何功能。这样做更有利,因为它可以避免每次为每个组检查==
"+"
或"-"
!相反,您可以首先使用+
对所有组进行分组,然后按V6
进行分组,然后修改这些行到位!
同样,您再次为"-"
执行此操作。希望有所帮助。
注意:
.N
是一个特殊变量,包含每组的观察数量。
答案 1 :(得分:3)
首先,我会将您的示例数据加载到R中(目前无法将dput()
与data.table
一起使用):
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
V1 V2 V3 V4 V5 V6
1 chr1 3205901 3207317 . - ENSMUSG00000051951
2 chr1 3206523 3207317 . - ENSMUSG00000051951
3 chr1 3213439 3215632 . - ENSMUSG00000051951
4 chr1 3213609 3216344 . - ENSMUSG00000051951
5 chr1 3214482 3216968 . - ENSMUSG00000051951
6 chr1 3421702 3421901 . - ENSMUSG00000051951
7 chr1 3102016 3102125 . + ENSMUSG00000064842
8 chr1 3466587 3466687 . + ENSMUSG00000089699
9 chr1 3513405 3513553 . + ENSMUSG00000089699
10 chr1 3054233 3054733 . + ENSMUSG00000090025")
使用dplyr几乎可以优雅地解决您的问题:
library(dplyr)
df %>%
group_by(V6, V5) %>%
mutate(index = row_number(V2))
(我假设V2是你要索引的变量 - 我认为最好是明确而不是依赖行的顺序行)
但是你想要不同子集的不同摘要,这在dplyr中目前不容易。一种方法是分裂然后重新组合:
rbind_list(
df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)),
df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2)))
)
但由于必须制作两份数据,因此这将相对较慢。
另一种方法是在摘要中使用if:
df %>%
group_by(V6, V5) %>%
mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2)))