我是R.的新手。我有一个如下数据框:
df=data.frame(S_id=c("s13261","s13261","s13082","s13082","s2936","s2936","s2999","s2999","s2999","s2999","s2999"),T_id=c("A_3","BC_2","CT_5","G_32","HU_8","HU_9","Pk_4","Op_12","WQ_54","MN_23","NB_1"),Start=c(17947,18405,87,1220,2982,2982,13820,32320,38734,38741,44031),End=c(18363,19966,1259,3433,4597,4073,15014,33618,40603,40603,44339),Plus_minus=c("-","-","+","+","+","+","-","+","-","-","+"),status=c("5pp","3pp","3pp","5pp","5pp","5pp","5pp","5pp","3pp","3pp","5pp"))
我想遍历数据框并根据S_id
对行进行分组。稍后在组中我想比较组中的每一行与其连续行,并仅保留第二行的Start
值与其前一行的End
值之差小于100的行状态为5pp and 3pp
或3pp and 5pp
。我期望的输出粘贴在下面:
S_id T_id Start End Plus_minus status
s13261 A_3 17947 18363 - 5pp
s13261 BC_2 18405 19966 - 3pp
s13082 CT_5 87 1259 + 3pp
s13082 G_32 1220 3433 + 5pp
请指导我
答案 0 :(得分:1)
您可以使用精彩的dplyr
包。
df %>%
group_by(S_id) %>%
mutate(diff = Start - lag(End), diff_status = lag(status))
函数lag
访问之前的元素实际元素。现在,唯一要做的就是过滤新创建的列并决定是否要保留NA
和负值(即重叠):
df %>%
group_by(S_id) %>%
mutate(diff = Start - lag(End), diff_status = lag(status)) %>%
filter(diff < 100 | is.na(diff), diff_status != status | is.na(diff_status)) %>%
select(-diff,-diff_status)
您的示例的结果如下:
S_id T_id Start End Plus_minus status
<fctr> <fctr> <dbl> <dbl> <fctr> <fctr>
1 s13261 A_3 17947 18363 - 5pp
2 s13261 BC_2 18405 19966 - 3pp
3 s13082 CT_5 87 1259 + 3pp
4 s13082 G_32 1220 3433 + 5pp
5 s2936 HU_8 2982 4597 + 5pp
6 s2999 Pk_4 13820 15014 - 5pp