Question

这个问题应该很简单，但我不知道如何在R中完成。基本上我有一个双列文件，其中，第一列代表脚手架ID，第二列是SNP在该特定支架中的位置。

我想创建一个新列，其中第一个支架的SNP位置（NEW_POS）是相同的，但对于第二个支架（以及之后），NEW_POS将是添加最后一个POS值的结果先前支架对第二支架中的每个POS（即50 + 17,50 + 23,50 + 46，......）。对于第三个脚手架（96 + 13,96 + 19,96 + 38，......）等。正如你在这里看到的那样：

id    POS  NEW_POS
0001  38   38
0001  46   46
0001  50   50
0002  17   67
0002  23   73
0002  46   96
0003  13   109
0003  19   115
0003  38   134
...   ...  ...

Answer 1

以下是使用lag和cumsum的解决方案：

library(dplyr)

df1 %>%
    mutate(
        scf.len = (id != lag(id, default = id[1])) * lag(POS, default = 0),
        New_POS = cumsum(scf.len) + POS
    )  %>%
    select(-scf.len)
#     id POS New_POS
# 1 0001  38      38
# 2 0001  46      46
# 3 0001  50      50
# 4 0002  17      67
# 5 0002  23      73
# 6 0002  46      96
# 7 0003  13     109
# 8 0003  19     115
# 9 0003  38     134

数据：

> dput(df1)
structure(list(id = c("0001", "0001", "0001", "0002", "0002", 
"0002", "0003", "0003", "0003"), POS = c(38, 46, 50, 17, 23, 
46, 13, 19, 38)), .Names = c("id", "POS"), class = "data.frame", row.names = c(NA, 
-9L))

Answer 2

以下是data.table解决方案，该解决方案同时使用shift()和cumsum()但更新加入

library(data.table)
DT[DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))], on = "id", 
   NEW_POS := POS + V2][]

返回

     id POS NEW_POS
1: 0001  38      38
2: 0001  46      46
3: 0001  50      50
4: 0002  17      67
5: 0002  23      73
6: 0002  46      96
7: 0003  13     109
8: 0003  19     115
9: 0003  38     134

解释

tmp <- DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))][]
tmp
#     id V2
#1: 0001  0
#2: 0002 50
#3: 0003 96

选择每个id组的最后一个值，将其移位（滞后）一，然后计算累积总和。

然后，使用完整的data.table

将此结果正确加入id

DT[tmp, on = "id", NEW_POS := POS + V2][]

从而创建NEW_POS 到位。

数据

DT <- structure(list(id = c("0001", "0001", "0001", "0002", "0002", 
"0002", "0003", "0003", "0003"), POS = c(38L, 46L, 50L, 17L, 
23L, 46L, 13L, 19L, 38L)), .Names = c("id", "POS"), row.names = c(NA, 
-9L), class = "data.frame")
#coerce to data.table
setDT(DT)

如何将ID的最后一个值添加到下一个ID

2 个答案:

解释

数据