这是我的数据
test2
start_first length start_second length_second row_dna evalue end_first
1 145317 30 153190 30 2 3.33e+08 145347
2 145315 31 153188 31 6 1.23e+08 145346
3 145314 30 153186 32 10 4.47e+07 145344
4 145312 31 153184 33 14 1.60e+07 145343
5 145310 31 153183 32 18 4.47e+07 145341
6 145317 31 262038 33 22 1.60e+07 145348
8 145316 31 262036 34 30 5.67e+06 145347
10 145314 31 262034 37 38 2.36e+05 145345
11 153186 32 178732 33 42 1.60e+07 153218
12 145317 35 178735 30 46 1.99e+06 145352
13 178737 33 245830 38 50 7.99e+04 178770
14 178736 33 245829 37 54 2.36e+05 178769
15 178733 32 245828 34 58 5.67e+06 178765
16 145317 30 178737 32 62 4.47e+07 145347
17 145316 30 178736 32 66 4.47e+07 145346
18 145318 32 221384 33 70 1.60e+07 145350
19 145317 31 221383 32 74 4.47e+07 145348
20 145315 31 221383 30 78 1.23e+08 145346
我想在每一行中分配一个组,取决于start_first和start_second列
这是我期望的输出
test2
start_first length start_second length_second row_dna evalue end_first group
1 145317 30 153190 30 2 3.33e+08 145347 1
2 145315 31 153188 31 6 1.23e+08 145346 1
3 145314 30 153186 32 10 4.47e+07 145344 1
4 145312 31 153184 33 14 1.60e+07 145343 1
5 145310 31 153183 32 18 4.47e+07 145341 1
6 145317 31 262038 33 22 1.60e+07 145348 2
8 145316 31 262036 34 30 5.67e+06 145347 2
10 145314 31 262034 37 38 2.36e+05 145345 2
11 153186 32 178732 33 42 1.60e+07 153218 3
12 145317 35 178735 30 46 1.99e+06 145352 3
13 178737 33 245830 38 50 7.99e+04 178770 4
14 178736 33 245829 37 54 2.36e+05 178769 4
15 178733 32 245828 34 58 5.67e+06 178765 4
16 145317 30 178737 32 62 4.47e+07 145347 5
17 145316 30 178736 32 66 4.47e+07 145346 5
18 145318 32 221384 33 70 1.60e+07 145350 6
19 145317 31 221383 32 74 4.47e+07 145348 6
20 145315 31 221383 30 78 1.23e+08 145346 6
我用来对数据进行分组的标准仅取决于位置。首先看一下开头第一列和开始第二列,这样第一行到第六行的位置非常相似,所以这些行都是组合在一起
同一组中的行应该彼此非常接近(不应超过10个位置)
有没有办法解决这个问题。谢谢你的回答。
答案 0 :(得分:0)
如果我理解你的问题,以下内容可能涵盖 - 原则上 - 你在寻找什么。但是,可能需要进一步调整。
distMatrix = as.matrix(dist(cbind(data$start_second, data$start_second))) # Get distances between rows
hist(distMatrix, breaks = 1000)
threshold = 500 # Set a threshold for grouping. Change as desired
distMatrix.bool = distMatrix < threshold
assigned.i = c() # Keep track of assigned rows, to avoid redundancy
data$group2 = -1 # Initialzie a new column
for(i in 1:nrow(data)) { # Loop over the rows
if(i %in% assigned.i) # move to the next loop if already assigned
next
sameGroup = which(distMatrix.bool[i,])
data$group2[sameGroup] = i
assigned.i = c(assigned.i, sameGroup)
}
assigned.i = c() # Keep track of assigned rows, to avoid redundancy
data$group3 = -1 # Initialzie a new column
j = 1 # Group counter
for(i in 1:nrow(data)) { # Loop over the rows
if(i %in% assigned.i) # move to the next loop if already assigned
next
sameGroup = which(distMatrix.bool[i,])
data$group3[sameGroup] = j
j = j + 1
assigned.i = c(assigned.i, sameGroup)
}