添加数据组取决于他们的位置

时间:2015-11-16 07:50:59

标签: r

这是我的数据

test2
   start_first length start_second length_second row_dna   evalue end_first
1       145317     30       153190            30       2 3.33e+08    145347
2       145315     31       153188            31       6 1.23e+08    145346
3       145314     30       153186            32      10 4.47e+07    145344
4       145312     31       153184            33      14 1.60e+07    145343
5       145310     31       153183            32      18 4.47e+07    145341
6       145317     31       262038            33      22 1.60e+07    145348
8       145316     31       262036            34      30 5.67e+06    145347
10      145314     31       262034            37      38 2.36e+05    145345
11      153186     32       178732            33      42 1.60e+07    153218
12      145317     35       178735            30      46 1.99e+06    145352
13      178737     33       245830            38      50 7.99e+04    178770
14      178736     33       245829            37      54 2.36e+05    178769
15      178733     32       245828            34      58 5.67e+06    178765
16      145317     30       178737            32      62 4.47e+07    145347
17      145316     30       178736            32      66 4.47e+07    145346
18      145318     32       221384            33      70 1.60e+07    145350
19      145317     31       221383            32      74 4.47e+07    145348
20      145315     31       221383            30      78 1.23e+08    145346

我想在每一行中分配一个组,取决于start_first和start_second列

这是我期望的输出

test2
   start_first length start_second length_second row_dna   evalue end_first group
1       145317     30       153190            30       2 3.33e+08    145347 1
2       145315     31       153188            31       6 1.23e+08    145346 1
3       145314     30       153186            32      10 4.47e+07    145344 1
4       145312     31       153184            33      14 1.60e+07    145343 1
5       145310     31       153183            32      18 4.47e+07    145341 1
6       145317     31       262038            33      22 1.60e+07    145348 2
8       145316     31       262036            34      30 5.67e+06    145347 2
10      145314     31       262034            37      38 2.36e+05    145345 2
11      153186     32       178732            33      42 1.60e+07    153218 3
12      145317     35       178735            30      46 1.99e+06    145352 3
13      178737     33       245830            38      50 7.99e+04    178770 4
14      178736     33       245829            37      54 2.36e+05    178769 4
15      178733     32       245828            34      58 5.67e+06    178765 4
16      145317     30       178737            32      62 4.47e+07    145347 5
17      145316     30       178736            32      66 4.47e+07    145346 5
18      145318     32       221384            33      70 1.60e+07    145350 6
19      145317     31       221383            32      74 4.47e+07    145348 6
20      145315     31       221383            30      78 1.23e+08    145346 6

我用来对数据进行分组的标准仅取决于位置。首先看一下开头第一列和开始第二列,这样第一行到第六行的位置非常相似,所以这些行都是组合在一起

同一组中的行应该彼此非常接近(不应超过10个位置)

有没有办法解决这个问题。谢谢你的回答。

1 个答案:

答案 0 :(得分:0)

如果我理解你的问题,以下内容可能涵盖 - 原则上 - 你在寻找什么。但是,可能需要进一步调整。

distMatrix = as.matrix(dist(cbind(data$start_second, data$start_second))) # Get  distances between rows
hist(distMatrix, breaks = 1000)

threshold = 500 # Set a threshold for grouping. Change as desired 
distMatrix.bool = distMatrix < threshold


assigned.i = c() # Keep track of assigned rows, to avoid redundancy
data$group2 = -1 # Initialzie a new column 

for(i in 1:nrow(data)) { # Loop over the rows

  if(i %in% assigned.i) # move to the next loop if already assigned 
    next

  sameGroup = which(distMatrix.bool[i,])
  data$group2[sameGroup] = i

  assigned.i = c(assigned.i, sameGroup)
}



assigned.i = c() # Keep track of assigned rows, to avoid redundancy
data$group3 = -1 # Initialzie a new column 

j = 1 # Group counter 

for(i in 1:nrow(data)) { # Loop over the rows

  if(i %in% assigned.i) # move to the next loop if already assigned 
    next

  sameGroup = which(distMatrix.bool[i,])
  data$group3[sameGroup] = j
  j = j + 1

  assigned.i = c(assigned.i, sameGroup)
}