Question

我有df喜欢：

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

我希望加入共享相同chr和strand且具有相似起点和终点的行（例如100 +/-距离）。对于执行行连接的那些列，我还想连接SampleID名称和Value。使用前面的示例，例如：

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

想法？谢谢！

编辑：

我找到了R（https://cran.r-project.org/web/packages/fuzzyjoin/index.html）的fuzzyjoin包。有没有人有这个包的经验？

EDIT2：

如果仅将其中一个变量（SampleID或Value）连接起来也会很好。

Answer 1

我们可以按照＆＃39; Chr＆＃39;＆＃39; Strand＆＃39;进行分组，根据＆＃39;开始＆＃39;中的相邻元素之间的差异创建分组ID。并且＆＃39;结束＆＃39;在order开始＆＃39;结束＆＃39;之后的列，然后按照＆＃39; Chr＆＃39;＆＃39; Strand＆＃39;和＆＃39; ind＆＃39;，获取＆＃39;开始＆＃39;＆＃39;结束＆＃39;的第一个元素，同时paste＆＃39; SampleID＆＃39}中的元素;和＆＃39;价值＆＃39;柱

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

注意：假设＆＃39; df＆＃39;是data.table

加入数据框中具有相似（但不相等）值的行

1 个答案: