删除R中列的子集上的重复项

时间:2014-03-24 10:13:00

标签: r

我有一张表

     [,1] [,2] [,3]       [,4]       [,5]
 [1,]    1    5   10 0.00040803 0.00255277
 [2,]    1   11    3 0.01765470 0.01584580
 [3,]    1    6    2 0.15514850 0.15509000
 [4,]    1    8   14 0.02100531 0.02572320
 [5,]    1    9    4 0.04748648 0.00843252
 [6,]    2    5   10 0.00040760 0.06782680
 [7,]    2   11    3 0.01765480 0.01584580
 [8,]    2    6    2 0.15514810 0.15509000
 [9,]    2    8   14 0.02100491 0.02572320
[10,]    2    9    4 0.04748608 0.00843252
[11,]    3    5   10 0.00040760 0.06782680
[12,]    3   11    3 0.01765480 0.01584580
[13,]    3    8   14 0.02100391 0.02572320
[14,]    3    9    4 0.04748508 0.00843252
[15,]    4    5   10 0.00040760 0.06782680
[16,]    4   11    3 0.01765480 0.01584580
[17,]    4    8   14 0.02100391 0.02572320
[18,]    4    9    4 0.04748508 0.00843252
[19,]    5    8   14 0.02100391 0.02572320
[20,]    5    9    4 0.04748508 0.00843252

我想从此表中删除重复项。但是,只有colums 2,3,4很重要。示例:如果仅观察到列2,3,4,则行1,6,11,15是相同的。第4栏的注意事项:是否可以加入它被认为是相同的,只要它在值的10e-5范围内?因此第1行和第6行被认为是相同的,尽管第4列中的值略有不同(在我提到的容差范围内)?

然后获得一个类似的输出会很棒:

column 2 value | column 3 value | column 1 value at which the the pair has been first observed (with the tolerance) (in the example 1) | column 1 value at which the pair has been last observed (with tolerance) (in the example 4) | value of column 4 at first appearance (0.00040803 in the example)

2 个答案:

答案 0 :(得分:0)

这是一种思考方式,但我不确定它是你在寻找什么。逻辑应该能让你开始。

dat <- YOUR DATA SET
dat
   V1 V2 V3         V4         V5
1   1  5 10 0.00040803 0.00255277
2   1 11  3 0.01765470 0.01584580
3   1  6  2 0.15514850 0.15509000
4   1  8 14 0.02100531 0.02572320
5   1  9  4 0.04748648 0.00843252
# TRUNCATED

dat <- dat[, c(2, 3, 4)]
dat$V4 <- round(dat$V4, 5)

unique(dat)
  V2 V3      V4
1  5 10 0.00041
2 11  3 0.01765
3  6  2 0.15515
4  8 14 0.02101
5  9  4 0.04749
9  8 14 0.02100

答案 1 :(得分:0)

你可以这样做:

# read your data
yy <- read.csv('your-data.csv', header=F)

##   V1 V2 V3         V4         V5
## 1  1  5 10 0.00040803 0.00255277
## 2  1 11  3 0.01765470 0.01584580
## 3  1  6  2 0.15514850 0.15509000
## 4  1  8 14 0.02100531 0.02572320

# create a logical matrix indicating value is within tolerance
mat.eq.tol <- sapply(yy$V4, function(x) abs(yy$V4-x) < 1E-5)
# minimum index
eq.min <- apply(mat.eq.tol, 1, function(x) min(which(x)))
# maximum index
eq.max <- apply(mat.eq.tol, 1, function(x) max(which(x)))

# combine result
res <- cbind(yy$V2, yy$V3, yy$V1[eq.min], yy$V1[eq.max], yy$V4[eq.min])

##       [,1] [,2] [,3] [,4]       [,5]
## [1,]    5   10    1    4 0.00040803
## [2,]   11    3    1    4 0.01765470
## [3,]    6    2    1    2 0.15514850
## [4,]    8   14    1    5 0.02100531
## [5,]    9    4    1    5 0.04748648
## [6,]    5   10    1    4 0.00040803