遍历R中的数据框并保留R

时间:2016-11-03 13:38:46

标签: r dataframe

我是R.的新手。我有一个如下数据框:

df=data.frame(S_id=c("s13261","s13261","s13082","s13082","s2936","s2936","s2999","s2999","s2999","s2999","s2999"),T_id=c("A_3","BC_2","CT_5","G_32","HU_8","HU_9","Pk_4","Op_12","WQ_54","MN_23","NB_1"),Start=c(17947,18405,87,1220,2982,2982,13820,32320,38734,38741,44031),End=c(18363,19966,1259,3433,4597,4073,15014,33618,40603,40603,44339),Plus_minus=c("-","-","+","+","+","+","-","+","-","-","+"),status=c("5pp","3pp","3pp","5pp","5pp","5pp","5pp","5pp","3pp","3pp","5pp"))

我想遍历数据框并根据S_id对行进行分组。稍后在组中我想比较组中的每一行与其连续行,并仅保留第二行的Start值与其前一行的End值之差小于100的行状态为5pp and 3pp3pp and 5pp。我期望的输出粘贴在下面:

S_id   T_id   Start   End   Plus_minus  status
s13261  A_3  17947   18363        -      5pp
s13261  BC_2 18405   19966        -      3pp
s13082  CT_5  87     1259         +      3pp
s13082  G_32  1220   3433         +      5pp

请指导我

1 个答案:

答案 0 :(得分:1)

您可以使用精彩的dplyr包。

df %>%
group_by(S_id) %>%
mutate(diff = Start - lag(End), diff_status = lag(status))

函数lag访问之前的元素实际元素。现在,唯一要做的就是过滤新创建的列并决定是否要保留NA和负值(即重叠):

df %>%
  group_by(S_id) %>%
  mutate(diff = Start - lag(End), diff_status = lag(status)) %>%
  filter(diff < 100 | is.na(diff), diff_status != status | is.na(diff_status)) %>%
  select(-diff,-diff_status)

您的示例的结果如下:

    S_id   T_id Start   End Plus_minus status
  <fctr> <fctr> <dbl> <dbl>     <fctr> <fctr>
1 s13261    A_3 17947 18363          -    5pp
2 s13261   BC_2 18405 19966          -    3pp
3 s13082   CT_5    87  1259          +    3pp
4 s13082   G_32  1220  3433          +    5pp
5  s2936   HU_8  2982  4597          +    5pp
6  s2999   Pk_4 13820 15014          -    5pp